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Preface 


Intended Audience 


This text is intended for juniors, seniors, or beginning graduate students in statistics, 
mathematics, natural sciences, and engineering as well as for adequately prepared 
students in the social sciences and economics. A year of calculus, including Taylor 
Series and multivariable calculus, and an introductory course in linear algebra are 
prerequisites. 


This Book's Objectives 


This book reflects my view of what a first, and for many students a last, course in 
statistics should be. Such a course should include some traditional topics in mathe- 
matical statistics (such as methods based on likelihood), topics in descriptive statistics 
and data analysis with special attention to graphical displays, aspects of experimental 
design, and realistic applications of some complexity. It should also reflect the inte- 
gral role played by computers in statistics. These themes, properly interwoven, can 
give students a view of the nature of modern statistics. The alternative of teaching 
two separate courses, one on theory and one on data analysis, seems to me artificial. 
Furthermore, many students take only one course in statistics and do not have time 
for two or more. 


Analysis of Data and the Practice 
of Statistics 


In order to draw the above themes together, I have endeavored to write a book closely 
tied to the practice of statistics. It is in the analysis of real data that one sees the roles 
played by both formal theory and informal data analytic methods. I have organized 
this book around various kinds of problems that entail the use of statistical methods 
and have included many real examples to motivate and introduce the theory. Among 
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the advantages of such an approach are that theoretical constructs are presented in 
meaningful contexts, that they are gradually supplemented and reinforced, and that 
they are integrated with more informal methods. This is, I think, a fitting approach 
to statistics, the historical development of which has been spurred on primarily by 
practical needs rather than by abstract or aesthetic considerations. At the same time, 
I have not shied away from using the mathematics that the students are supposed to 
know. 


The Third Edition 


Eighteen years have passed since the first edition of this book was published and 
eleven years since the second. Although the basic intent and stucture of the book 
have not changed, the new editions reflect developments in the discipline of statistics, 
primarily the computational revolution. 

The most significant change in this edition is the treatment of Bayesian infer- 
ence. I moved the material from the last chapter, a point that was never reached by 
many instructors, and integrated it into earlier chapters. Bayesian inference is now 
first previewed in Chapter 3, in the context of conditional distributions. It is then 
placed side-by-side with frequentist methods in Chapter 8, where it complements the 
material on maximum likelihood estimation very naturally. The introductory section 
on hypothesis testing in Chapter 9 now begins with a Bayesian formulation before 
moving on to the Neyman-Pearson paradigm. One advantage of this is that the funda- 
mental importance of the likelihood ratio is now much more apparent. In applications, 
I stress uninformative priors and show the similarity of the qualitative conclusions 
that would be reached by frequentist and Bayesian methods. 

Other new material includes the use of examples from genomics and financial 
statistics in the probability chapters. In addition to its value as topically relevant, this 
material naturally reinforces basic concepts. For example, the material on copulas 
underscores the relationships of marginal and joint distributions. Other changes in- 
clude the introduction of scatterplots and correlation coefficients within the context 
of exploratory data analysis in Chapter 10 and a brief introduction to nonparametric 
smoothing via local linear least squares in Chapter 14. There are nearly 100 new 
problems, mainly in Chapters 7-14, including several new data sets. Some of the data 
sets are sufficiently substantial to be the basis for computer lab assignments. I also 
elucidated many passages that were obscure in earlier editions. 


Brief Outline 


A complete outline can be found, of course, in the Table of Contents. Here I will just 
highlight some points and indicate various curricular options for the instructor. 

The first six chapters contain an introduction to probability theory, particularly 
those aspects most relevant to statistics. Chapter 1 introduces the basic ingredients 
of probability theory and elementary combinatorial methods from a non measure 
theoretic point of view. In this and the other probability chapters, I tried to use real- 
world examples rather than balls and urns whenever possible. 
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The concept of a random variable is introduced in Chapter 2. I chose to discuss 
discrete and continuous random variables together, instead of putting off the contin- 
uous case until later. Several common distributions are introduced. An advantage of 
this approach is that it provides something to work with and develop in later chapters. 

Chapter 3 continues the treatment of random variables by going into joint dis- 
tributions. The instructor may wish to skip lightly over Jacobians; this can be done 
with little loss of continuity, since they are rarely used in the rest of the book. The 
material in Section 3.7 on extrema and order statistics can be omitted if the instructor 
is willing to do a little backtracking later. 

Expectation, variance, covariance, conditional expectation, and moment-gene- 
rating functions are taken up in Chapter 4. The instructor may wish to pass lightly 
over conditional expectation and prediction, especially if he or she does not plan to 
cover sufficiency later. The last section of this chapter introduces the 6 method, or 
the method of propagation of error. This method is used several times in the statistics 
chapters. 

The law of large numbers and the central limit theorem are proved in Chapter 5 
under fairly strong assumptions. 

Chapter 6 is acompendium of the common distributions related to the normal and 
sampling distributions of statistics computed from the usual normal random sample. 
I don't spend a lot of time on this material here but do develop the necessary facts 
as they are needed in the statistics chapters. It is useful for students to have these 
distributions collected in one place. 

Chapter 7 is on survey sampling, an unconventional, but in some ways natural, 
beginning to the study of statistics. Survey sampling is an area of statistics with 
which most students have some vague familiarity, and a set of fairly specific, concrete 
statistical problems can be naturally posed. It is a context in which, historically, many 
important statistical concepts have developed, and it can be used as a vehicle for 
introducing concepts and techniques that are developed further in later chapters, for 
example: 


* Theideaofanestimate as arandom variable with an associated sampling distribution 

* The concepts of bias, standard error, and mean squared error 

* Confidence intervals and the application of the central limit theorem 

* An exposure to notions of experimental design via the study of stratified estimates 
and the concept of relative efficiency 

* Calculation of expectations, variances, and covariances 


One of the unattractive aspects of survey sampling is that the calculations are rather 
grubby. However, there is a certain virtue in this grubbiness, and students are given 
practice in such calculations. The instructor has quite a lot of flexibility as to how 
deeply to cover the concepts in this chapter. The sections on ratio estimation and 
stratification are optional and can be skipped entirely or returned to at a later time 
without loss of continuity. 

Chapter 8 is concerned with parameter estimation, a subject that is motivated 
and illustrated by the problem of fitting probability laws to data. The method of 
moments, the method of maximum likelihood, and Bayesian inference are developed. 
The concept of efficiency is introduced, and the Cramér-Rao Inequality is proved. 
Section 8.8 introduces the concept of sufficiency and some of its ramifications. The 


xiv 


Preface 


material on the Cramér-Rao lower bound and on sufficiency can be skipped; to my 
mind, the importance of sufficiency is usually overstated. Section 8.7.1 (the negative 
binomial distribution) can also be skipped. 

Chapter 9 is an introduction to hypothesis testing with particular application 
to testing for goodness of fit, which ties in with Chapter 8. (This subject is further 
developed in Chapter 11.) Informal, graphical methods are presented here as well. 
Several of the last sections of this chapter can be skipped if the instructor is pressed 
for time. These include Section 9.6 (the Poisson dispersion test), Section 9.7 (hanging 
rootograms), and Section 9.9 (tests for normality). 

A variety of descriptive methods are introduced in Chapter 10. Many of these 
techniques are used in later chapters. The importance of graphical procedures is 
stressed, and notions of robustness are introduced. The placement of a chapter on 
descriptive methods this late in a book may seem strange. I chose to do so be- 
cause descriptive procedures usually have a stochastic side and, having been through 
the three chapters preceding this one, students are by now better equipped to study the 
statistical behavior of various summary statistics (for example, a confidence interval 
for the median). When I teach the course, I introduce some of this material earlier. 
For example, I have students make boxplots and histograms from samples drawn in 
labs on survey sampling. If the instructor wishes, the material on survival and hazard 
functions can be skipped. 

Classical and nonparametric methods for two-sample problems are introduced 
in Chapter 11. The concepts of hypothesis testing, first introduced in Chapter 9, 
are further developed. The chapter concludes with some discussion of experimental 
design and the interpretation of observational studies. 

The first eleven chapters are the heart of an introductory course; the theoretical 
constructs of estimation and hypothesis testing have been developed, graphical and 
descriptive methods have been introduced, and aspects of experimental design have 
been discussed. 

The instructor has much more freedom in selecting material from Chapters 12 
through 14. In particular, it is not necessary to proceed through these chapters in the 
order in which they are presented. 

Chapter 12 treats the one-way and two-way layouts via analysis of variance and 
nonparametric techniques. The problem of multiple comparisons, first introduced at 
the end of Chapter 11, is discussed. 

Chapter 13 is a rather brief treatment of the analysis of categorical data. Likeli- 
hood ratio tests are developed for homogeneity and independence. McNemar's test 
is presented and finally, estimation of the odds ratio is motivated by a discussion of 
prospective and retrospective studies. 

Chapter 14 concerns linear least squares. Simple linear regression is developed 
first and is followed by a more general treatment using linear algebra. I chose to 
employ matrix algebra but keep the level of the discussion as simple and concrete as 
possible, not going beyond concepts typically taught in an introductory one-quarter 
course. In particular, I did not develop a geometric analysis of the general linear model 
or make any attempt to unify regression and analysis of variance. Throughout this 
chapter, theoretical results are balanced by more qualitative data analytic procedures 
based on analysis of residuals. At the end of the chapter, I introduce nonparametric 
regression via local linear least squares. 
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Computer Use and Problem Solving 


Computation is an integral part of contemporary statistics. It is essential for data 
analysis and can be an aid to clarifying basic concepts. My students use the open- 
source package R, which they can install on their own computers. Other packages 
could be used as well but I do not discuss any particular programs in the text. The 
data in the text are available on the CD that is bound in the U.S. edition or can be 
downloaded from www.thomsonedu.com/statistics. 

This book contains a large number of problems, ranging from routine reinforce- 
ment of basic concepts to some that students will find quite difficult. I think that 
problem solving, especially of nonroutine problems, is very important. 
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CHAPTER 1 


Probability 


Introduction 


The idea of probability, chance, or randomness is quite old, whereas its rigorous 
axiomatization in mathematical terms occurred relatively recently. Many of the ideas 
of probability theory originated in the study of games of chance. In this century, the 
mathematical theory of probability has been applied to a wide variety of phenomena; 
the following are some representative examples: 


Probability theory has been used in genetics as a model for mutations and ensuing 
natural variability, and plays a central role in bioinformatics. 

The kinetic theory of gases has an important probabilistic component. 

In designing and analyzing computer operating systems, the lengths of various 
queues in the system are modeled as random phenomena. 

There are highly developed theories that treat noise in electrical devices and com- 
munication systems as random processes. 

Many models of atmospheric turbulence use concepts of probability theory. 

In operations research, the demands on inventories of goods are often modeled as 
random. 

Actuarial science, which is used by insurance companies, relies heavily on the tools 
of probability theory. 

Probability theory is used to study complex systems and improve their reliability, 
such as in modern commercial or military aircraft. 

Probability theory is a cornerstone of the theory of finance. 


The list could go on and on. 

This book develops the basic ideas of probability and statistics. The first part 
explores the theory of probability as a mathematical model for chance phenomena. 
The second part of the book is about statistics, which is essentially concerned with 
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2  Chapter1 Probability 


1.2 


EXAMPLEA 


EXAMPLE B 


EXAMPLEC 


procedures for analyzing data, especially data that in some vague sense have a random 
character. To comprehend the theory of statistics, you must have a sound background 
in probability. 


Sample Spaces 


Probability theory is concerned with situations in which the outcomes occur randomly. 
Generically, such situations are called experiments, and the set of all possible outcomes 
is the sample space corresponding to an experiment. The sample space is denoted by 
Q2, and an element of Q is denoted by w. The following are some examples. 


Driving to work, a commuter passes through a sequence of three intersections with 
traffic lights. At each light, she either stops, s, or continues, c. The sample space is 
the set of all possible outcomes: 


Q = (ccc, ccs, CSS, CSC, SSS, SSC, SCC, SCS] 


where csc, for example, denotes the outcome that the commuter continues through 
the first light, stops at the second light, and continues through the third light. | 


The number of jobs in a print queue of a mainframe computer may be modeled as 
random. Here the sample space can be taken as 


Q = {0, 1, 2,3,...} 


that is, all the nonnegative integers. In practice, there is probably an upper limit, N, 
on how large the print queue can be, so instead the sample space might be defined as 


Q= (0,1,2,..., N} u 


Earthquakes exhibit very erratic behavior, which is sometimes modeled as random. 
For example, the length of time between successive earthquakes in a particular region 
that are greater in magnitude than a given threshold may be regarded as an experiment. 
Here Q is the set of all nonnegative real numbers: 


Q = {t |t > 0} E 


We are often interested in particular subsets of $2, which in probability language 
are called events. In Example A, the event that the commuter stops at the first light is 
the subset of €2 denoted by 


A = (sss, ssc, scc, scs] 


1.2 Sample Spaces 3 


(Events, or subsets, are usually denoted by italic uppercase letters.) In Example B, 
the event that there are fewer than five jobs in the print queue can be denoted by 


A = {0, 1,2, 3, 4} 


The algebra of set theory carries over directly into probability theory. The union 
of two events, A and B, is the event C that either A occurs or B occurs or both occur: 
C = AUB. For example, if A is the event that the commuter stops at the first light 
(listed before), and if B is the event that she stops at the third light, 


B = (sss, scs, ccs, css} 


then C is the event that she stops at the first light or stops at the third light and consists 
of the outcomes that are in A or in B or in both: 


C = (sss, ssc, SCC, SCS, CCS, CSS} 


The intersection of two events, C = AN B, is the event that both A and B occur. 
If A and B are as given previously, then C is the event that the commuter stops at 
the first light and stops at the third light and thus consists of those outcomes that are 
common to both A and B: 


C = {sss, scs} 


The complement of an event, A‘, is the event that A does not occur and thus 
consists of all those elements in the sample space that are not in A. The complement 
of the event that the commuter stops at the first light is the event that she continues at 
the first light: 


A* = (ccc, ccs, css, csc} 


You may recall from previous exposure to set theory a rather mysterious set called 
the empty set, usually denoted by Ø. The empty set is the set with no elements; it 
is the event with no outcomes. For example, if A is the event that the commuter 
stops at the first light and C is the event that she continues through all three lights, 
C = {ccc}, then A and C have no outcomes in common, and we can write 


ANC=6 


In such cases, A and C are said to be disjoint. 

Venn diagrams, such as those in Figure 1.1, are often a useful tool for visualizing 
set operations. 

The following are some laws of set theory. 


Commutative Laws: 
AUB=BUA 
ANB=BNA 
Associative Laws: 
(AUB)UC=AU(BUC) 
(AN B)NC=AN(BNC) 
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Distributive Laws: 


(AU B)1C (An C)U(Bn C) 
(An B)JUC = (AUC)N (BUC) 


Of these, the distributive laws are the least intuitive, and you may find it instructive 
to illustrate them with Venn diagrams. 


AUB ANB 


FIGURE 1.1 Venn diagrams of AU B and An B. 


Probability Measures 


A probability measure on Q is a function P from subsets of Q to the real numbers 
that satisfies the following axioms: 


1. P(Q) — I. 
2. If AC Q, then P(A) > 0. 
3. If A, and A» are disjoint, then 


P(A, U A5) = P (A1) + P (42). 


More generally, if A1, A», ..., An, ... are mutually disjoint, then 


p (U a) -Y Pub 
i=! i=l 


The first two axioms are obviously desirable. Since © consists of all possible out- 
comes, P(&2) = 1. The second axiom simply states that a probability is nonnegative. 
The third axiom states that if A and B are disjoint—that is, have no outcomes in 
common—then P(A U B) = P(A) 4- P(B) and also that this property extends to 
limits. For example, the probability that the print queue contains either one or three 
jobs is equal to the probability that it contains one plus the probability that it contains 
three. 

The following properties of probability measures are consequences of the axioms. 


Property A P(A‘) = 1 — P(A). This property follows since A and A* are disjoint 
with A U A* = Q and thus, by the first and third axioms, P(A) + P(A‘) = 1. In 
words, this property says that the probability that an event does not occur equals one 
minus the probability that it does occur. 


Property B P(f) = 0. This property follows from Property A since Ø = Q°. In 
words, this says that the probability that there is no outcome at all is zero. 
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Property C If A C B, then P(A) < P(B). This property states that if B occurs 
whenever A occurs, then P(A) < P(B). For example, if whenever it rains (A) it is 
cloudy (B), then the probability that it rains is less than or equal to the probability 
that it is cloudy. Formally, it can be proved as follows: B can be expressed as the 
union of two disjoint sets: 


B-—AU(Bn A^?) 
Then, from the third axiom, 
P(B) = P(A) - P(BN A^) 
and thus 
P(A) = P(B) — P(BN A^) x P(B) 
Property D Addition Law P(A U B) = P(A) + P(B) — P(AN B). This property 
is easy to see from the Venn diagram in Figure 1.2. If P(A) and P(B) are added 


together, P(A N B) is counted twice. To prove it, we decompose A U B into three 
disjoint subsets, as shown in Figure 1.2: 


C=ANB 
D=ANB 
E=A°NB 





FIGURE 1.2 Venn diagram illustrating the addition law. 


We then have, from the third axiom, 
P(AU B)= P(C)+ P(D)+ P(E) 


Also, A = C U D, and C and D are disjoint; so P(A) = P(C) + P(D). Similarly, 
P(B) = P(D) + P(E). Putting these results together, we see that 


P(A) + P(B) = P(C) + P(E) +2P(D) 
= P(AUB)+ P(D) 


or 


P(AU B) = P(A) + P(B) — P(D) 
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EXAMPLEA 


Suppose that a fair coin is thrown twice. Let A denote the event of heads on the first 
toss, and let B denote the event of heads on the second toss. The sample space is 


Q = (hh, ht, th, tt} 


We assume that each elementary outcome in €2 is equally likely and has probability 
L. C = AUB is the event that heads comes up on the first toss or on the second toss. 
Clearly, P(C) Z P(A) + P(B) = 1. Rather, since A B is the event that heads 


comes up on the first toss and on the second toss, 


P(C) = P(A) + P(B) - P(AN B) 2.54 .5— .25 = .75 u 


An article in the Los Angeles Times (August 24, 1987) discussed the statistical risks 
of AIDS infection: 


Several studies of sexual partners of people infected with the virus show that 
a single act of unprotected vaginal intercourse has a surprisingly low risk of 
infecting the uninfected partner—perhaps one in 100 to one in 1000. For an 
average, consider the risk to be one in 500. If there are 100 acts of intercourse 
with an infected partner, the odds of infection increase to one in five. 

Statistically, 500 acts of intercourse with one infected partner or 100 acts 
with five partners lead to a 100% probability of infection (statistically, not 
necessarily in reality). 


Following this reasoning, 1000 acts of intercourse with one infected partner would 
lead to a probability of infection equal to 2 (statistically, but not necessarily in reality). 
To see the flaw in the reasoning that leads to this conclusion, consider two acts of 
intercourse. Let A, denote the event that infection occurs on the first act and let A» 
denote the event that infection occurs on the second act. Then the event that infection 
occurs is B = A; U A> and 


2 
P(B) = P(A) + P(A5) - P(ALN Az) < P(A) + P(A) = guo - 


Computing Probabilities: 
Counting Methods 


Probabilities are especially easy to compute for finite sample spaces. Suppose that 
Q = (m1, %, ..., @y} and that P(w;) = pj. To find the probability of an event A, 
we simply add the probabilities of the œ; that constitute A. 


Suppose that a fair coin is thrown twice and the sequence of heads and tails is recorded. 
The sample space is 


Q = (hh, ht, th, tt} 


EXAMPLE B 


1.4.1 
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As in Example A of the previous section, we assume that each outcome in & has 
probability .25. Let A denote the event that at least one head is thrown. Then A — 
(Ahh, ht, th), and P(A) = .75. B 


This is a simple example of a fairly common situation. The elements of Q all 
have equal probability; so if there are N elements in Q, each of them has probability 
1/N.If A can occur in any of n mutually exclusive ways, then P(A) — n/N, or 


number of ways A can occur 





P(A) = 
total number of outcomes 
Note that this formula holds only if all the outcomes are equally likely. In Exam- 
ple A, if only the number of heads were recorded, then €2 would be (0, 1, 2}. These 
outcomes are not equally likely, and P(A) is not i. E 


Simpson's Paradox 

A black urn contains 5 red and 6 green balls, and a white urn contains 3 red and 4 
green balls. You are allowed to choose an urn and then choose a ball at random from 
the urn. If you choose a red ball, you get a prize. Which urn should you choose to 
draw from? If you draw from the black urn, the probability of choosing a red ball is 
a = .455 (the number of ways you can draw a red ball divided by the total number 
of outcomes). If you choose to draw from the white urn, the probability of choosing 
a red ball is i = .429, so you should choose to draw from the black urn. 

Now consider another game in which a second black urn has 6 red and 3 green 
balls, and a second white urn has 9 red and 5 green balls. If you draw from the black 
urn, the probability of a red ball is £ = .667, whereas if you choose to draw from the 
white urn, the probability is 5 = .643. So, again you should choose to draw from 
the black urn. 

In the final game, the contents of the second black urn are added to the first black 
urn, and the contents of the second white urn are added to the first white urn. Again, 
you can choose which urn to draw from. Which should you choose? Intuition says 
choose the black urn, but let's calculate the probabilities. The black urn now contains 
11 red and 9 green balls, so the probability of drawing a red ball from it is En =I; 
The white urn now contains 12 red and 9 green balls, so the probability of drawing a red 
ball from it is 5 = .571. So, you should choose the white urn. This counterintuitive 
result is an example of Simpson's paradox. For an example that occurred in real life, 
see Section 11.4.7. For more amusing examples, see Gardner (1976). L| 


In the preceding examples, it was easy to count the number of outcomes and 
calculate probabilities. To compute probabilities for more complex situations, we 
must develop systematic ways of counting outcomes, which are the subject of the 
next two sections. 


The Multiplication Principle 


The following is a statement of the very useful multiplication principle. 
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MULTIPLICATION PRINCIPLE 


If one experiment has m outcomes and another experiment has n outcomes, then 
there are mn possible outcomes for the two experiments. 


Proof 
Denote the outcomes of the first experiment by a;,...,a,, and the outcomes of 
the second experiment by bı, ..., bn. The outcomes for the two experiments are 


the ordered pairs (a;, b;). These pairs can be exhibited as the entries of an m x n 
rectangular array, in which the pair (a;, b;) is in the ith row and the jth column. 
There are mn entries in this array. H 


Playing cards have 13 face values and 4 suits. There are thus 4 x 13 — 52 face- 
value/suit combinations. L| 


A class has 12 boys and 18 girls. The teacher selects 1 boy and 1 girl to act as 
representatives to the student government. She can do this in any of 12 x 18 — 216 
different ways. E 


EXTENDED MULTIPLICATION PRINCIPLE 


If there are p experiments and the first has n; possible outcomes, the second 
m, .. . , andthe pthz possible outcomes, then there are atotal ofn, x n» x --- x 
ny possible outcomes for the p experiments. 


Proof 


This principle can be proved from the multiplication principle by induction. 
We saw that it is true for p —2. Assume that it is true for p — q—that is, that 
there are n; X n? x --- x n, possible outcomes for the first q experiments. To 
complete the proof by induction, we must show that it follows that the prop- 
erty holds for p = q + 1. We apply the multiplication principle, regarding 


the first q experiments as a single experiment with n; x --- x n, outcomes, 
and conclude that there are (n; x --- x ng) X n;4, outcomes for the q + 1 
experiments. n 


EXAMPLE C An8-bit binary word is a sequence of 8 digits, of which each may be either a 0 or a 1. 


How many different 8-bit words are there? 


EXAMPLED 


1.4.2 
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There are two choices for the first bit, two for the second, etc., and thus there are 
2x2x2x2x2x2x2x2=2' = 256 


such words. Oo 


ADNA molecule is a sequence of four types of nucleotides, denoted by A, G, C, and T. 
The molecule can be millions of units long and can thus encode an enormous amount 
of information. For example, for a molecule 1 million (10°) units long, there are 
4!% different possible sequences. This is a staggeringly large number having nearly a 
million digits. An amino acid is coded for by a sequence of three nucleotides; there are 
4? = 64 different codes, but there are only 20 amino acids since some of them can be 
coded for in several ways. A protein molecule is composed of as many as hundreds of 
amino acid units, and thus there are an incredibly large number of possible proteins. 
For example, there are 20!?? different sequences of 100 amino acids. a 


Permutations and Combinations 


A permutation is an ordered arrangement of objects. Suppose that from the set 
C = {c1, C2,...,C,} we choose r elements and list them in order. How many ways 
can we do this? The answer depends on whether we are allowed to duplicate items 
in the list. If no duplication is allowed, we are sampling without replacement. If 
duplication is allowed, we are sampling with replacement. We can think of the 
problem as that of taking labeled balls from an urn. In the first type of sampling, we 
are not allowed to put a ball back before choosing the next one, but in the second, we 
are. In either case, when we are done choosing, we have a list of r balls ordered in 
the sequence in which they were drawn. 

The extended multiplication principle can be used to count the number of different 
ordered samples possible for a set of n elements. First, suppose that sampling is done 
with replacement. The first ball can be chosen in any of n ways, the second in any 


of n ways, etc., so that there are n x n x --- x n = n” samples. Next, suppose that 
sampling is done without replacement. There are n choices for the first ball, n — 1 
choices for the second ball, n — 2 for the third, ..., and n — r + 1 for the rth. We 


have just proved the following proposition. 


PROPOSITION A 


For a set of size n and a sample of size r, there are n’ different ordered sam- 
ples with replacement and n(n — 1)(n — 2)--- (n — r + 1) different ordered 
samples without replacement. E 


COROLLARY A 


The number of orderings of n elements is n(n — 1)(n — 2)--- 1 =n!. m 
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How many ways can five children be lined up? 
This corresponds to sampling without replacement. According to Corollary A, 
thereare5! 25x 4x 3x 2x 1 = 120 different lines. L| 


Suppose that from ten children, five are to be chosen and lined up. How many different 
lines are possible? 
From Proposition A, there are 10 x 9x 8 x 7 x 6 = 30,240 different lines. W 


In some states, license plates have six characters: three letters followed by three 
numbers. How many distinct such plates are possible? 
This corresponds to sampling with replacement. There are 26? — 17,576 different 


ways to choose the letters and 10° = 1000 ways to choose the numbers. Using 
the multiplication principle again, we find there are 17,576 x 1000 — 17,576,000 
different plates. E 


If all sequences of six characters are equally likely, what is the probability that the 
license plate for a new car will contain no duplicate letters or numbers? 

Call the desired event A; Q consists of all 17,576,000 possible sequences. Since 
these are all equally likely, the probability of A is the ratio of the number of ways 
that A can occur to the total number of possible outcomes. There are 26 choices for 
the first letter, 25 for the second, 24 for the third, and hence 26 x 25 x 24 — 15,600 
ways to choose the letters without duplication (doing so corresponds to sampling 
without replacement), and 10 x 9 x 8 — 720 ways to choose the numbers without 
duplication. From the multiplication principle, there are 15,600 x 720 — 11,232,000 
nonrepeating sequences. The probability of A is thus 


11,232,000 _ 


P(A) = ——————— =. 
(A) 17,576,000 


Birthday Problem 
Suppose that a room contains n people. What is the probability that at least two of 
them have a common birthday? 

This is a famous problem with a counterintuitive answer. Assume that every day 
of the year is equally likely to be a birthday, disregard leap years, and denote by A 
the event that at least two people have a common birthday. As is sometimes the case, 
finding P(A‘) is easier than finding P(A). This is because A can happen in many 
ways, whereas A^ is much simpler. There are 365" possible outcomes, and A^ can 
happen in 365 x 364 x --- x (365 — n+ 1) ways. Thus, 


365 x 364 x --- x (365—n-4 1) 


PAIS 365" 
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and 
365 x 364 x ... x (365—n4 1) 
365" 


The following table exhibits the latter probabilities for various values of n: 


P(A) =1 





n P(A) 

4 .016 
16 .284 
23 .507 
32 .153 
40 .891 
56 .988 





From the table, we see that if there are only 23 people, the probability of at least one 
match exceeds .5. The probabilities in the table are larger than one might intuitively 
guess, showing that the coincidence is not unlikely. Try it in your class. E 


How many people must you ask to have a 50 : 50 chance of finding someone who 
shares your birthday? 

Suppose that you ask n people; let A denote the event that someone's birthday is 
the same as yours. Again, working with A^ is easier. The total number of outcomes 
is 365", and the total number of ways that A^ can happen is 364". Thus, 





364" 
P(A‘) = — 
365" 
and 
364" 
P(A)=1- 
365" 
For the latter probability to be .5, n should be 253, which may seem counterintuitive. 


We now shift our attention from counting permutations to counting combina- 
tions. Here we are no longer interested in ordered samples, but in the constituents 
of the samples regardless of the order in which they were obtained. In particular, 
we ask the following question: If r objects are taken from a set of n objects without 
replacement and disregarding order, how many different samples are possible? From 
the multiplication principle, the number of ordered samples equals the number of 
unordered samples multiplied by the number of ways to order each sample. Since the 
number of ordered samples is n(n — 1) --- (n — r + 1), and since a sample of size r 
can be ordered in r! ways (Corollary A), the number of unordered samples is 


nna Deere ly n! 


r! ^ (n—r)tr! 





This number is also denoted as Cie We have proved the following proposition. 
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PROPOSITION B 


The number of unordered samples of r objects selected from n objects without 
replacement is (7). 


The numbers (7), called the binomial coefficients, occur in the expansion 


(a+b) = Dp i eor 


k=0 
In particular, 


k=0 
This latter result can be interpreted as the number of subsets of a set of n objects. 
We just add the number of subsets of size O (with the usual convention that 
0! — 1), and the number of subsets of size 1, and the number of subsets of 
SIZE etc H 


Up until 1991, a player of the California state lottery could win the jackpot prize by 
choosing the 6 numbers from 1 to 49 that were subsequently chosen at random by 
the lottery officials. There are (2) = 13,983,816 possible ways to choose 6 numbers 
from 49, and so the probability of winning was about 1 in 14 million. If there were 
no winners, the funds thus accumulated were rolled over (carried over) into the next 
round of play, producing a bigger jackpot. In 1991, the rules were changed so that 
a winner had to correctly select 6 numbers from 1 to 53. Since (5 ) = 22,957,480, 
the probability of winning decreased to about 1 in 23 million. Because of the ensuing 
rollover, the jackpot accumulated to a record of about $120 million. This produced a 
fever of play—people were buying tickets at the rate of between 1 and 2 million per 
hour and state revenues burgeoned. E 


In the practice of quality control, only a fraction of the output of a manufacturing 
process is sampled and examined, since it may be too time-consuming and expensive 
to examine each item, or because sometimes the testing is destructive. Suppose that 
n items are in a lot and a sample of size r is taken. There are (") such samples. Now 
suppose that the lot contains k defective items. What is the probability that the sample 
contains exactly m defectives? 

Clearly, this question is relevant to the efficacy of the sampling scheme, and the 
most desirable sample size can be determined by computing such probabilities for 
various values of r. Call the event in question A. The probability of A is the number 
of ways A can occur divided by the total number of outcomes. To find the number of 
ways A can occur, we use the multiplication principle. There are (£) ways to choose 
the m defective items in the sample from the k defectives in the lot, and there are 
p ways to choose the r — m nondefective items in the sample from the n — k 


nondefectives in the lot. Therefore, A can occur in (5 (=) ways. Thus, P (A) is the 


m r—m 
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ratio of the number of ways A can occur to the total number of outcomes, or 


od a) 
n 


P(A) = 


5 


Capture/Recapture Method 
The so-called capture/recapture method is sometimes used to estimate the size of a 
wildlife population. Suppose that 10 animals are captured, tagged, and released. On 
a later occasion, 20 animals are captured, and it is found that 4 of them are tagged. 
How large is the population? 

We assume that there are n animals in the population, of which 10 are tagged. 
If the 20 animals captured later are taken in such a way that all (o) possible groups 
are equally likely (this is a big assumption), then the probability that 4 of them are 
tagged is (using the technique of the previous example) 


(DC) 


Clearly, n cannot be precisely determined from the information at hand, but it can be 
estimated. One method of estimation, called maximum likelihood, is to choose that 
value of n that makes the observed outcome most probable. (The method of maximum 
likelihood is one of the main subjects of a later chapter in this text.) The probability 
of the observed outcome as a function of n is called the likelihood. Figure 1.3 shows 
the likelihood as a function of n; the likelihood is maximized at n = 50. 


0.35 + 
0.30 - 
0.25 + ka EA 


0.20 F 


Likelihood 


0.15 F 








FIGURE 1.3 Likelihood for Example I. 


To find the maximum likelihood estimate, suppose that, in general, t animals are 
tagged. Then, of a second sample of size m, r tagged animals are recaptured. We 
estimate n by the maximizer of the likelihood 


oa 
ka) 


L, = 
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To find the value of n that maximizes L, , consider the ratio of successive terms, which 
after some algebra is found to be 


L, _ (n m t)(n = m) 
Lazi ~ n(n—t—m-+r) 





This ratio is greater than 1, i.e., L,, is increasing, if 


(n—t)(n—m)>n(n—t—m-+r) 
n? — nm — nt 4- mt > n? — nt — nm -nr 


mt > nr 


Thus, L, increases for n < mt/r and decreases for n > mt/r; so the value of n that 
maximizes L, is the greatest integer not exceeding mt/r. 

Applying this result to the data given previously, we see that the maximum 
likelihood estimate of n is ** = a = 50. This estimate has some intuitive appeal, 
as it equates the proportion of tagged animals in the second sample to the proportion 


in the population: 


Proposition B has the following extension. 


PROPOSITION C 


The number of ways that n objects can be grouped into r classes with n; in the 
sth class;7 = 15 or and x np =n is 


n n! 
nin: +- N, ni!n!---n,! 


This can be seen by using Proposition B and the multiplication principle. (Note 


Proof 


that Proposition B is the special case for which r = 2.) There are (") ways 
to choose the objects for the first class. Having done that, there are ("7”') 
ways of choosing the objects for the second class. Continuing in this manner, 
there are 


n! (n — n)! (n—n;—n3—---—n, 4)! 





nı!(n — ny)! (n — n4 — n5)!n5! O!n,! 


choices in all. After cancellation, this yields the desired result. H 
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A committee of seven members is to be divided into three subcommittees of size 
three, two, and two. This can be done in 


7 7! 
= = 210 
[5 31212! 


ways. E 





In how many ways can the set of nucleotides (A, A, G, G, G, G, C, C, C] be arranged 
in a sequence of nine letters? Proposition C can be applied by realizing that this 
problem can be cast as determining the number of ways that the nine positions in the 
sequence can be divided into subgroups of sizes two, four, and three (the locations of 
the letters A, G, and C): 


! 
WA e 
243) . 21413! m 


In how many ways can n = 2m people be paired and assigned to m courts for the first 
round of a tennis tournament? 





In this problem, n; = 2,i = 1,...,m, and, according to Proposition C, there 
are 
(2m)! 
2m 
assignments. 


One has to be careful with problems such as this one. Suppose we were asked 
how many ways 2m people could be arranged in pairs without assigning the pairs to 
courts. Since there are m! ways to assign the m pairs to m courts, the preceding result 
should be divided by m!, giving 


(2m)! 


m!2™ 





pairs in all. E 


n 


The numbers ( 
expansion 


) are called multinomial coefficients. They occur in the 


njn2:-n; 

n n nj |n» ny 

(x14 x2 bx) = Xj XX, 
njn;-:--:n, 


where the sum is over all nonnegative integers n1, 75, ..., n, such that n; + n» + 
see +A, SN. 
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1.5 Conditional Probability 


We introduce the definition and use of conditional probability with an example. Digi- 
talis therapy is often beneficial to patients who have suffered congestive heart failure, 
but there is the risk of digitalis intoxication, a serious side effect that is difficult to 
diagnose. To improve the chances of a correct diagnosis, the concentration of digitalis 
in the blood can be measured. Bellar et al. (1971) conducted a study of the relation 
of the concentration of digitalis in the blood to digitalis intoxication in 135 patients. 
Their results are simplified slightly in the following table, where this notation is used: 


T+ = high blood concentration (positive test) 








T — — low blood concentration (negative test) 
D+ = toxicity (disease present) 
D— = no toxicity (disease absent) 
D+ D— | Total 
T+ 25 14 39 
T— 18 78 96 
Total | 43 92 135 








Thus, for example, 25 of the 135 patients had a high blood concentration of digitalis 
and suffered toxicity. 

Assume that the relative frequencies in the study roughly hold in some larger 
population of patients. (Making inferences about the frequencies in a large population 
from those observed in a small sample is a statistical problem, which will be taken 
up in a later chapter of this book.) Converting the frequencies in the preceding table 
to proportions (relative to 135), which we will regard as probabilities, we obtain the 
following table: 








D+ D— Total 
T+ 4185 .104 .289 
T— .133 .578 711 
Total | .318 .682 | 1.000 








From the table, P(T4-) = .289 and P(D+) = .318, for example. Now if a doctor 
knows that the test was positive (that there was a high blood concentration), what is the 
probability of disease (toxicity) given this knowledge? We can restrict our attention 
to the first row of the table, and we see that of the 39 patients who had positive tests, 
25 suffered from toxicity. We denote the probability that a patient shows toxicity given 
that the test is positive by P(D + | T+), which is called the conditional probability 
of D+ given T +. 


P(D4 |T4)— 2 640 
==: 
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Equivalently, we can calculate this probability as 


P(D +N T+) 





P(D + |T+) = 


In summary, we see that the unconditional probability of D + is .318, whereas 
the conditional probability D + given T+ is .640. Therefore, knowing that the test is 
positive makes toxicity more than twice as likely. What if the test is negative? 

578 
P(D— |T—-) = — = .848 
( | ) 711 
For comparison, P(D—) = .682. Two other conditional probabilities from this ex- 
ample are of interest: The probability of a false positive is P(D — | T+) = .360, and 
the probability of a false negative is P(D + | T—) = .187. 
In general, we have the following definition. 


DEFINITION 


Let A and B be two events with P(B) 4 0. The conditional probability of A 
given B is defined to be 


P(An B) 


The idea behind this definition is that if we are given that event B occurred, 
the relevant sample space becomes B rather than Q, and conditional probability is a 
probability measure on B. In the digitalis example, to find P (D+ | T+), we restricted 
our attention to the 39 patients who had positive tests. For this new measure to be a 
probability measure, it must satisfy the axioms, and this can be shown. 

In some situations, P(A| B) and P(B) can be found rather easily, and we can 
then find P(A N B). 


MULTIPLICATION LAW 
Let A and B be events and assume P (B) 4 0. Then 
P(An B) = P(A| B)P(B) EH 


The multiplication law is often useful in finding the probabilities of intersections, 
as the following examples illustrate. 


EXAMPLE A An urn contains three red balls and one blue ball. Two balls are selected without 


replacement. What is the probability that they are both red? 
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Let R; and R, denote the events that a red ball is drawn on the first trial and on 


the second trial, respectively. From the multiplication law, 


P(R, N R2) = P(R1) P(R2| R1) 


P(R)) is clearly i, and if a red ball has been removed on the first trial, there are two 
red balls and one blue ball left. Therefore, P (R, | R1) = Z, Thus, P(R,' Ro) = 1 


2 
E 


Suppose that if it is cloudy (B), the probability that it is raining (A) is .3, and that 
the probability that it is cloudy is P(B) = .2 The probability that it is cloudy and 
raining is 


P(An B) = P(A| B)P(B) = .3 x .2 = .06 m 
Another useful tool for computing probabilities is provided by the following law. 


LAW OF TOTAL PROBABILITY 


Let Bi, B», ..., B, be such that ( J; , B; = Q and B; N B; 2 fori z j, with 
P(B;) > 0 for all i. Then, for any event A, 


P(A) = X P(A| B)P(B) 
iSl 
Proof 


Before going through a formal proof, it is helpful to state the result in words. The 
B; are mutually disjoint events whose union is €2. To find the probability of an 
event A, we sum the conditional probabilities of A given B;, weighted by P(B;). 
Now, for the proof, we first observe that 


P(A) = P(ANQ) 


Since the events A N B; are disjoint, 
P (Ui n s») = ` P(A N Bj) 
i imi 


EE m 


i=l 
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The law of total probability is useful in situations where it is not obvious how to 
calculate P (A) directly but in which P(A | B;) and P(B;) are more straightforward, 
such as in the following example. 


Referring to Example A, what is the probability that a red ball is selected on the 
second draw? 

The answer may or may not be intuitively obvious—that depends on your in- 
tuition. On the one hand, you could argue that it is "clear from symmetry" that 
P(R2) = P(R\) = F, On the other hand, you could say that it is obvious that a red 
ball is likely to be selected on the first draw, leaving fewer red balls for the second 
draw, so that P(R2) < P (R1). The answer can be derived easily by using the law of 
total probability: 


P(R5) = P(R5| R)PC(R)) + PCR | B)PGA) 
2 3 +i 1 3 
i X x = 
3 4 4 4 
where B, denotes the event that a blue ball is drawn on the first trial. a 





As another example of the use of conditional probability, we consider a model 
that has been used for occupational mobility. 


Suppose that occupations are grouped into upper (U), middle (M), and lower (L) 
levels. U, will denote the event that a father's occupation is upper-level; Uz will 
denote the event that a son's occupation is upper-level, etc. (The subscripts index 
generations.) Glass and Hall (1954) compiled the following statistics on occupational 
mobility in England and Wales: 





| Uz M Lz 
U, .45 48 .07 
Mi .05 70 25 
Li .01 .50 .49 


Such a table, which is called a matrix of transition probabilities, is to be read in 
the following way: If a father is in U, the probability that his son is in U is .45, the 
probability that his sonis in M is .48, etc. Thetable thus gives conditional probabilities: 
for example, P(U;| U1) = .45. Examination of the table reveals that there is more 
upward mobility from L into M than from M into U. Suppose that of the father's 
generation, 10% are in U, 40% in M, and 50% in L. What is the probability that a 
son in the next generation is in U? 
Applying the law of total probability, we have 


P(U5) = P(U5 |Uj)) P(U1) + P(U2| M) P(M1) + PU? | L) P(L1) 
45 x 10+ .05 x .40 + .01 x .50 = .07 


P (M») and P(L2) can be worked out similarly. m" 
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Continuing with Example D, suppose we ask a different question: If a son has 
occupational status Uz, what is the probability that his father had occupational status 
U,? Compared to the question asked in Example D, this is an “inverse” problem; we 
are given an "effect" and are asked to find the probability of a particular “cause.” In 
situations like this, Bayes' rule, which we state shortly, is useful. Before stating the 
rule, we will see what it amounts to in this particular case. 

We wish to find P (U; | U2). By definition, 


P(U, N U2) 
P(U2) 


P(U, | U2) 


- P(U5|U)) P(U1) 
P(U2|U;)P(U,) + P(U2| Mi) PCM) + P(U2| £1) P(L1) 





Here we used the multiplication law to reexpress the numerator and the law of 

total probability to restate the denominator. The value of the numerator is 

P(U2|U,)P(U,) = .45 x .10 = .045, and we calculated the denominator in Exam- 

ple D to be .07, so we find that P (U; | U2) = .64. In other words, 64% of the sons who 

are in upper-level occupations have fathers who were in upper-level occupations. 
We now state Bayes’ rule. 


BAYES' RULE 


Let A and B,,..., B, be events where the B; are disjoint, Da B; = &2, and 
P(B;) > 0 for all i. Then 
P(A|B,)P(B; 
p(B, |a) - PAI BPD 
dS P(A| Bi) P(Bi) 
i=l 





The proof of Bayes’ rule follows exactly as in the preceding discussion. Oo 


Diamond and Forrester (1979) applied Bayes’ rule to the diagnosis of coronary artery 
disease. A procedure called cardiac fluoroscopy is used to determine whether there 
is calcification of coronary arteries and thereby to diagnose coronary artery disease. 
From the test, it can be determined if 0, 1, 2, or 3 coronary arteries are calcified. Let 
To, T1, D, T4 denote these events. Let D+ or D— denote the event that disease is 
present or absent, respectively. Diamond and Forrester presented the following table, 
based on medical studies: 








i P(T; | D+) P(T; | D—) 
0 A2 .96 
1 .24 .02 
2 20 .02 
3 15 .00 
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According to Bayes' rule, 
P(T; | D+)P(D+) 
PC; | D+)P(D+) + PG; | D=)P(D=) 
Thus, if the initial probabilities P(D--) and P(D—) are known, the probability that 
a patient has coronary artery disease can be calculated. 

Let us consider two specific cases. For the first, suppose that a male between 
the ages of 30 and 39 suffers from nonanginal chest pain. For such a patient, it is 
known from medical statistics that P(D+) 7& .05. Suppose that the test shows that 
no arteries are calcified. From the preceding equation, 

42 x .05 E 
42x.054.96x.95 — 
It is unlikely that the patient has coronary artery disease. On the other hand, suppose 
that the test shows that one artery is calcified. Then 

.24 x .05 u 
.24 x 05 + .02 x .95 | 
Now it is more likely that this patient has coronary artery disease, but by no means 
certain. 


As a second case, suppose that the patient is a male between ages 50 and 59 who 
suffers typical angina. For such a patient, P(D+) = .92. For him, we find that 





P(D+|T;) = 


P(D+| h) = 02 





P(D+|T,) = 39 











kopie 42 x .92 ET 
07 42x.924.96x.08 ` 
24 x .92 
PDA: | Tj = .99 


.24 x .92 + .02 x .08 — 


Comparing the two patients, we see the strong influence of the prior probability, 
P(D4). L| 


Polygraph tests (lie-detector tests) are often routinely administered to employees 
or prospective employees in sensitive positions. Let + denote the event that the 
polygraph reading is positive, indicating that the subject is lying; let T denote the 
event that the subject is telling the truth; and let L denote the event that the subject is 
lying. According to studies of polygraph reliability (Gastwirth 1987), 


P(+|L) = .88 
from which it follows that P(— | L) = .12 also 
P(— |T) = .86 


from which it follows that P(+ | T) = .14. In words, if a person is lying, the prob- 
ability that this is detected by the polygraph is .88, whereas if he is telling the truth, 
the polygraph indicates that he is telling the truth with probability .86. Now suppose 
that polygraphs are routinely administered to screen employees for security reasons, 
and that on a particular question the vast majority of subjects have no reason to lie so 
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that P(T) = .99, whereas P(L) = .01. A subject produces a positive response on 
the polygraph. What is the probability that the polygraph is incorrect and that she is 
in fact telling the truth? We can evaluate this probability with Bayes’ rule: 
PCY |T)PT) 
P(+|T)P(T) + P(+|L)P(L) 

u (.14)(.99) 

— (.14)(.99) + (.88)(.01) 

= .94 


P(T|+)= 








Thus, in screening this population of largely innocent people, 94% of the positive 
polygraph readings will be in error. Most of those placed under suspicion because of 
the polygraph result will, in fact, be innocent. This example illustrates some of the 
dangers in using screening procedures on large populations. | 


Bayes’ rule is the fundamental mathematical ingredient of a subjective, or 
“Bayesian,” approach to epistemology, theories of evidence, and theories of learning. 
According to this point of view, an individual’s beliefs about the world can be coded 
in probabilities. For example, an individual’s belief that it will hail tomorrow can be 
represented by a probability P(H). This probability varies from individual to indi- 
vidual. In principle, each individual’s probability can be ascertained, or elicited, by 
offering him or her a series of bets at different odds. 

According to Bayesian theory, our beliefs are modified as we are confronted with 
evidence. If, initially, my probability for a hypothesis is P(H), after seeing evidence 
E (e.g., a weather forecast), my probability becomes P(H|E). P(E|H)is often easier 
to evaluate than P(H |E). In this case, the application of Bayes’ rule gives 


P(E|H)P(H) 
P(E|H)P(H) + P(E|H)P(H) 


where H is the event that H does not hold. This point can be illustrated by the 
preceding polygraph example. Suppose an investigator is questioning a particular 
suspect and that the investigator's prior opinion that the suspect is telling the truth 
is P(T). Then, upon observing a positive polygraph reading, his opinion becomes 
P(T |4-). Note that different investigators will have different prior probabilities P (T) 
for different suspects, and thus different posterior probabilities. 

As appealing as this formulation might be, a long line of research has demon- 
strated that humans are actually not very good at doing probability calculations in 
evaluating evidence. For example, Tversky and Kahneman (1974) presented subjects 
with the following question: "If Linda is a 31-year-old single woman who is outspo- 
ken on social issues such as disarmament and equal rights, which of the following 
statements is more likely to be true? 


P(H|E) — 





* Linda is bank teller. 
* Linda is a bank teller and active in the feminist movement." 


More than 8046 of those questioned chose the second statement, despite Property C 
of Section 1.3. 
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Even highly trained professionals are not good at doing probability calculations, 
as illustrated by the following example of Eddy (1982), regarding interpreting the 
results from mammogram screening. One hundred physicians were presented with 
the following information: 


In the absence of any special information, the probability that a woman (of the age 
and health status of this patient) has breast cancer is 1%. 

If the patient has breast cancer, the probability that the radiologist will correctly 
diagnose it is 80%. 

If the patient has a benign lesion (no breast cancer), the probability that the radiol- 
ogist will incorrectly diagnose it as cancer is 10%. 


They were then asked, “What is the probability that a patient with a positive mam- 
mogram actually has breast cancer?" 

Ninety-five of the 100 physicians estimated the probability to be about 75%. The 
correct probability, as given by Bayes' rule, is 7.596. (You should check this.) So even 
experts radically overestimate the strength of the evidence provided by a positive 
outcome on the screening test. 

Thus the Bayesian probability calculus does not describe the way people actually 
assimilate evidence. Advocates for Bayesian learning theory might assert that the 
theory describes the way people "should think." A softer point of view is that Bayesian 
learning theory is a model for learning, and it has the merit of being a simple model 
that can be programmed on computers. Probability theory in general, and Bayesian 
learning theory in particular, are part of the core of artificial intelligence. 


Independence 


Intuitively, we would say that two events, A and B, are independent if knowing that 
one had occurred gave us no information about whether the other had occurred; that 
is, P(A | B) = P(A) and P(B | A) = P(B). Now, if 
P(An B) 
P(A) = P(A | B) = ——— 
P(B) 
then 
P(An B) = P(A)P(B) 

We will use this last relation as the definition of independence. Note that itis symmetric 
in A and in B, and does not require the existence of a conditional probability, that is, 
P(B) can be 0. 


DEFINITION 


A and B are said to be independent events if P(A N B) = P(A)P(B). [| 


A card is selected randomly from a deck. Let A denote the event that it is an ace 
and D the event that it is a diamond. Knowing that the card is an ace gives no 
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information about its suit. Checking formally that the events are independent, we 
have P(A) — = = ü and P(D) — L, Also, AN D is the event that the card is the ace 
of diamonds and P(A N D) = i. Since P(A)P(D) — G) x (5) = i the events 
are in fact independent. E 


A system is designed so that it fails only if a unit and a backup unit both fail. Assuming 
that these failures are independent and that each unit fails with probability p, the 
system fails with probability p°. If, for example, the probability that any unit fails 
during a given year is .1, then the probability that the system fails is .01, which 
represents a considerable improvement in reliability. E 


Things become more complicated when we consider more than two events. For 
example, suppose we know that events A, B, and C are pairwise independent (any 
two are independent). We would like to be able to say that they are all independent 
based on the assumption that knowing something about two of the events does not tell 
us anything about the third, for example, P(C | AN B) = P(C). But as the following 
example shows, pairwise independence does not guarantee mutual independence. 


A fair coin is tossed twice. Let A denote the event of heads on the first toss, B the 
event of heads on the second toss, and C the event that exactly one head is thrown. A 
and B are clearly independent, and P(A) = P(B) = P(C) = .5. To see that A and 
C are independent, we observe that P(C | A) = .5. But 





P(AQBAC)=0# P(A)P(B)P(C) E 


To encompass situations such as that in Example C, we define a collection 
of events, A1, A», ..., An, to be mutually independent if for any subcollection, 
Ai... Ai, 


P(A +--+: Aj) = P(Ai)--- P(A,) 


We return to Example B of Section 1.3 (infectivity of AIDS). Suppose that virus 
transmissions in 500 acts of intercourse are mutually independent events and that 
the probability of transmission in any one act is 1/500. Under this model, what is the 
probability of infection? It is easier to first find the probability of the complement 
of this event. Let C1, C5, ..., Csoo denote the events that virus transmission does not 
occur during encounters 1, 2, ..., 500. Then the probability of no infection is 


~ 500 


so the probability of infection is 1 — .37 — .63, not 1, which is the answer produced 
by incorrectly adding probabilities. E 


1 500 
P( 0 nn Cw) = (1 ) = 37 
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Consider a circuit with three relays (Figure 1.4). Let A; denote the event that the ith 
relay works, and assume that P(A;) — p and thatthe relays are mutually independent. 
If F denotes the event that current flows through the circuit, then F = A3 U (A1 N A2) 
and, from the addition law and the assumption of independence, 


P(F) = P(A3) + P(A, N Ag) — P(A, N AN A) = p+ p— p? L| 





FIGURE 1.4 Circuit with three relays. 


Suppose that a system consists of components connected in a series, so the system 
fails if any one component fails. If there are n mutually independent components and 
each fails with probability p, what is the probability that the system will fail? 

It is easier to find the probability of the complement of this event; the system 
works if and only if all the components work, and this situation has probability 
(1 — p)". The probability that the system fails is then 1 — (1 — p)". For example, if 
n = 10 and p = .05, the probability that the system works is only .95'° = .60, and 
the probability that the system fails is .40. 

Suppose, instead, that the components are connected in parallel, so the system 
fails only when all components fail. In this case, the probability that the system fails 
is only .05!° = 9.8 x 107^, a 


Calculations like those in Example F are made in reliability studies for sys- 
tems consisting of quite complicated networks of components. The absolutely crucial 
assumption is that the components are independent of one another. Theoretical studies 
of the reliability of nuclear power plants have been criticized on the grounds that they 
incorrectly assume independence of the components. 


Matching DNA Fragments 
Fragments of DNA are often compared for similarity, for example, across species. 
A simple way to make a comparison is to count the number of locations, or sites, 
at which these fragments agree. For example, consider these two sequences, which 
agree at three sites: fragment 1: AGATCAGT; and fragment 2: TGGATACT. 

Many such comparisons are made, and to sort the wheat from the chaff, a prob- 
ability model is often used. A comparison is deemed interesting if the number of 
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matches is much larger than would be expected by chance alone. This requires a 
chance model; a simple one stipulates that the nucleotide at each site of fragment 1 
occurs randomly with probabilities p41, poi. Pci, Pri, and that the second fragment 
is similarly composed with probabilities p42, .... pr». What is the chance that the 
fragments match at a particular site if in fact the identity of the nucleotide on frag- 
ment | is independent of that on fragment 2? The match probability can be calculated 
using the law of total probability: 


P(match) — P (match|A on fragment 1) P(A on fragment 1) 4- 
... + P (match|T on fragment 1) P(T on fragment 1) 


= pa2pai + Pe2PGi + Pc2Pci + Pr2Pri 


The problem of determining the probability that they match at k out of a total of 
n Sites is discussed later. L| 


Concluding Remarks 


This chapter provides a simple axiomatic development of the mathematical theory of 
probability. Some subtle issues that arise in a careful analysis of infinite sample spaces 
have been neglected. Such issues are typically addressed in graduate-level courses 
in measure theory and probability theory. Certain philosophical questions have also 
been avoided. One might ask what is meant by the statement “The probability that 
this coin will land heads up is 4.” Two commonly advocated views are the frequen- 
tist approach and the Bayesian approach. According to the frequentist approach, 
the statement means that if the experiment were repeated many times, the long-run 
average number of heads would tend to i According to the Bayesian approach, the 
statement is a quantification of the speaker's uncertainty about the outcome of the 
experiment and thus is a personal or subjective notion; the probability that the coin 
will land heads up may be different for different speakers, depending on their ex- 
perience and knowledge of the situation. There has been vigorous and occasionally 
acrimonious debate among proponents of various versions of these points of view. 

In this and ensuing chapters, there are many examples of the use of probability 
as a model for various phenomena. In any such modeling endeavor, an idealized 
mathematical theory is hoped to provide an adequate match to characteristics of the 
phenomenon under study. The standard of adequacy is relative to the field of study 
and the modeler's goals. 


Problems 


1. A coin is tossed three times and the sequence of heads and tails is recorded. 


a. List the sample space. 

b. List the elements that make up the following events: (1) A — at least two 
heads, (2) B — the first two tosses are heads, (3) C — the last toss is a tail. 

c. List the elements of the following events: (1) A^, 2) AN B, (3) AUC. 


10. 


11. 


12. 
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. Two six-sided dice are thrown sequentially, and the face values that come up are 


recorded. 


a. List the sample space. 

b. List the elements that make up the following events: (1) A — the sum of the 
two values is at least 5, (2) B — the value of the first die is higher than the 
value of the second, (3) C — the first value is 4. 

c. List the elements of the following events: (1) ANC, (2) BUC, (3) AN(BUC). 


. An urn contains three red balls, two green balls, and one white ball. Three balls 


are drawn without replacement from the urn, and the colors are noted in sequence. 
Listthe sample space. Define events A, B, and C as you wish and find their unions 
and intersections. 


. Draw Venn diagrams to illustrate De Morgan's laws: 


(AU BY = A^ rn BS 


(An By = A*U B^ 


. Let A and B be arbitrary events. Let C be the event that either A occurs or B 


occurs, but not both. Express C in terms of A and B using any of the basic 
operations of union, intersection, and complement. 


. Verify the following extension of the addition rule (a) by an appropriate Venn 


diagram and (b) by a formal argument using the axioms of probability and the 
propositions in this chapter. 


P(AU BUC) = P(A) + P(B) + P(C) — P(An B) 


— P(An C) - P(BNC)+ P(AN BNC) 


. Prove Bonferroni's inequality: 


P(An B) => P(A)+ P(B)- 1 


. Prove that 


P (U^) < Y^P) 
i=l i=l 


. The weather forecaster says that the probability of rain on Saturday is 25% and 


that the probability of rain on Sunday is 25%. Is the probability of rain during 
the weekend 50%? Why or why not? 


Make up another example of Simpson’s paradox by changing the numbers in 
Example B of Section 1.4. 


The first three digits of a university telephone exchange are 452. If all the se- 
quences of the remaining four digits are equally likely, what is the probability 
that a randomly selected university phone number contains seven distinct digits? 


In a game of poker, five players are each dealt 5 cards from a 52-card deck. How 
many ways are there to deal the cards? 
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In a game of poker, what is the probability that a five-card hand will contain (a) 
a straight (five cards in unbroken numerical sequence), (b) four of a kind, and (c) 
a full house (three cards of one value and two cards of another value)? 


The four players in a bridge game are each dealt 13 cards. How many ways are 
there to do this? 


How many different meals can be made from four kinds of meat, six vegetables, 
and three starches if a meal consists of one selection from each group? 


How many different letter arrangements can be obtained from the letters of the 
word statistically, using all the letters? 


In acceptance sampling, a purchaser samples 4 items from a lot of 100 and rejects 
the lot if 1 or more are defective. Graph the probability that the lot is accepted as 
a function of the percentage of defective items in the lot. 


A lot of n items contains k defectives, and m are selected randomly and inspected. 
How should the value of m be chosen so that the probability that at least one 
defective item turns up is .90? Apply your answer to (a) n = 1000, k = 10, and 
(b) 1 — 10,000, k — 100. 


A committee consists of five Chicanos, two Asians, three African Americans, 
and two Caucasians. 


a. A subcommittee of four is chosen at random. What is the probability that all 
the ethnic groups are represented on the subcommittee? 
b. Answer the question for part (a) if a subcommittee of five is chosen. 


A deck of 52 cards is shuffled thoroughly. What is the probability that the four 
aces are all next to each other? 


A fair coin is tossed five times. What is the probability of getting a sequence of 
three heads? 


A standard deck of 52 cards is shuffled thoroughly, and n cards are turned up. 
What is the probability that a face card turns up? For what value of n is this 
probability about .5? 


How many ways are there to place n indistinguishable balls in n urns so that 
exactly one urn is empty? 


If n balls are distributed randomly into k urns, what is the probability that the 
last urn contains j balls? 


A woman getting dressed up for a night out is asked by her significant other to 
wear a red dress, high-heeled sneakers, and a wig. In how many orders can she 
put on these objects? 


The game of Mastermind starts in the following way: One player selects four 
pegs, each peg having six possible colors, and places them in a line. The sec- 
ond player then tries to guess the sequence of colors. What is the probability of 
guessing correctly? 
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Ifa five-letter word is formed at random (meaning that all sequences of five letters 
are equally likely), what is the probability that no letter occurs more than once? 


How many ways are there to encode the 26-letter English alphabet into 8-bit 
binary words (sequences of eight Os and 1s)? 


A poker player is dealt three spades and two hearts. He discards the two hearts and 
draws two more cards. What is the probability that he draws two more spades? 


A group of 60 second graders is to be randomly assigned to two classes of 30 each. 
(The random assignment is ordered by the school district to ensure against any 
bias.) Five of the second graders, Marcelle, Sarah, Michelle, Katy, and Camerin, 
are close friends. What is the probability that they will all be in the same class? 
What is the probability that exactly four of them will be? What is the probability 
that Marcelle will be in one class and her friends in the other? 


Six male and six female dancers perform the Virginia reel. This dance requires 
that they form a line consisting of six male/female pairs. How many such ar- 
rangements are there? 


A wine taster claims that she can distinguish four vintages of a particular Caber- 
net. What is the probability that she can do this by merely guessing? (She is 
confronted with four unlabeled glasses.) 


An elevator containing five people can stop at any of seven floors. What is the 
probability that no two people get off at the same floor? Assume that the occupants 
act independently and that all floors are equally likely for each occupant. 


Prove the following identity: 


SAED- 


(Hint: How can each of the summands be interpreted?) 


Prove the following two identities both algebraically and by interpreting their 
meaning combinatorially. 


a. (") = bee 
n n—1 n—-1 
b. (") = C3) + ( r ) 
What is the coefficient of x? y^ in the expansion of (x + y)’? 
What is the coefficient of x?y?z? in the expansion of (x + y + z)/? 


A child has six blocks, three of which are red and three of which are green. How 
many patterns can she make by placing them all ina line? If she is given three white 
blocks, how many total patterns can she make by placing all nine blocks in a line? 


A monkey at a typewriter types each of the 26 letters of the alphabet exactly once, 
the order being random. 


a. What is the probability that the word Hamlet appears somewhere in the string 
of letters? 
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b. How many independent monkey typists would you need in order that the 
probability that the word appears is at least .90? 


In how many ways can two octopi shake hands? (There are a number of ways to 
interpret this question—choose one.) 


A drawer of socks contains seven black socks, eight blue socks, and nine green 
socks. Two socks are chosen in the dark. 


a. What is the probability that they match? 
b. What is the probability that a black pair is chosen? 


How many ways can 11 boys on a soccer team be grouped into 4 forwards, 
3 midfielders, 3 defenders, and 1 goalie? 


A software development company has three jobs to do. Two of the jobs require 
three programmers, and the other requires four. If the company employs ten 
programmers, how many different ways are there to assign them to the jobs? 


In how many ways can 12 people be divided into three groups of 4 for an evening 
of bridge? In how many ways can this be done if the 12 consist of six pairs of 
partners? 


Show that if the conditional probabilities exist, then 


P(A Ass TE An) 
= P(A,)P(A2| AP(A3| A1 N A5) -- PCA, | Au O A Q0 D An-1) 


Urn A has three red balls and two white balls, and urn B has two red balls and 
five white balls. A fair coin is tossed. If it lands heads up, a ball is drawn from 
urn A; otherwise, a ball is drawn from urn B. 


a. What is the probability that a red ball is drawn? 
b. If a red ball is drawn, what is the probability that the coin landed heads up? 


Urn A has four red, three blue, and two green balls. Urn B has two red, three 
blue, and four green balls. A ball is drawn from urn A and put into urn B, and 
then a ball is drawn from urn B. 


a. What is the probability that a red ball is drawn from urn B? 


b. If a red ball is drawn from urn B, what is the probability that a red ball was 
drawn from urn A? 


An urn contains three red and two white balls. A ball is drawn, and then it and 

another ball of the same color are placed back in the urn. Finally, a second ball 

is drawn. 

a. What is the probability that the second ball drawn is white? 

b. If the second ball drawn is white, what is the probability that the first ball 
drawn was red? 


A fair coin is tossed three times. 


a. What is the probability of two or more heads given that there was at least one 
head? 
b. What is the probability given that there was at least one tail? 
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Two dice are rolled, and the sum of the face values is six. What is the probability 
that at least one of the dice came up a three? 


Answer Problem 50 again given that the sum is less than six. 


Suppose that 5 cards are dealt from a 52-card deck and the first one is a king. 
What is the probability of at least one more king? 


A fire insurance company has high-risk, medium-risk, and low-risk clients, who 
have, respectively, probabilities .02, .01, and .0025 of filing claims within a given 
year. The proportions of the numbers of clients in the three categories are .10, 
.20, and .70, respectively. What proportion of the claims filed each year come 
from high-risk clients? 


This problem introduces a simple meteorological model, more complicated 
versions of which have been proposed in the meteorological literature. Consider 
a sequence of days and let R; denote the event that it rains on day i. Suppose 
that P(R; | Ri-1) = œ and P(R? | R) = B. Suppose further that only today’s 
weather is relevant to predicting tomorrow's; that is, P(R; | Rj 4 O Rjj 30-0 
Ro) = P(R; | Ri-1). 

a. If the probability of rain today is p, what is the probability of rain tomorrow? 
b. What is the probability of rain the day after tomorrow? 

c. Whatisthe probability of rain days from now? What happens as n approaches 

infinity? 


This problem continues Example D of Section 1.5 and concerns occupational 

mobility. 

a. Find P(M, | M2) and P(L, | La). 

b. Find the proportions that will be in the three occupational levels in the third 
generation. To do this, assume that a son's occupational status depends on 
his father's status, but that given his father's status, it does not depend on his 
grandfather's. 


A couple has two children. What is the probability that both are girls given that 
the oldest is a girl? What is the probability that both are girls given that one of 
them is a girl? 


There are three cabinets, A, B, and C, each of which has two drawers. Each 
drawer contains one coin; A has two gold coins, B has two silver coins, and C 
has one gold and one silver coin. A cabinet is chosen at random, one drawer is 
opened, and a silver coin is found. What is the probability that the other drawer 
in that cabinet contains a silver coin? 


A teacher tells three boys, Drew, Chris, and Jason, that two of them will have 
to stay after school to help her clean erasers and that one of them will be able to 
leave. She further says that she has made the decision as to who will leave and who 
will stay at random by rolling a special three-sided Dungeons and Dragons die. 
Drew wants to leave to play soccer and has a clever idea about how to increase his 
chances of doing so. He figures that one of Jason and Chris will certainly stay and 
asks the teacher to tell him the name of one of the two who will stay. Drew's idea 
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is that if, for example, Jason is named, then he and Chris are left and they each 
have a probability .5 of leaving; similarly, if Chris is named, Drew's probability 
of leaving is still .5. Thus, by merely asking the teacher a question, Drew will 
increase his probability of leaving from 1 to i. What do you think of this scheme? 


A box has three coins. One has two heads, one has two tails, and the other is a 

fair coin with one head and one tail. A coin is chosen at random, is flipped, and 

comes up heads. 

a. What is the probability that the coin chosen is the two-headed coin? 

b. What is the probability that if it is thrown another time it will come up heads? 

c. Answer part (a) again, supposing that the coin is thrown a second time and 
comes up heads again. 


A factory runs three shifts. In a given day, 196 of the items produced by the first 
shift are defective, 296 of the second shift's items are defective, and 596 of the 
third shift's items are defective. If the shifts all have the same productivity, what 
percentage of the items produced in a day are defective? If an item is defective, 
what is the probability that it was produced by the third shift? 


Suppose that chips for an integrated circuit are tested and that the probability 
that they are detected if they are defective is .95, and the probability that they are 
declared sound if in fact they are sound is .97. If .5% of the chips are faulty, what 
is the probability that a chip that is declared faulty is sound? 


Show that if P(A | E) > P(B|E) and P(A| E°) > P(B | E°), then P(A) > 
P(B). 


Suppose that the probability of living to be older than 70 is .6 and the probability 
of living to be older than 80 is .2. If a person reaches her 70th birthday, what is 
the probability that she will celebrate her 80th? 


If B is an event, with P(B) > 0, show that the set function Q(A) = P(A | B) 
satisfies the axioms for a probability measure. Thus, for example, 


P(AUC|B)= P(A| B) -- P(C|B) — P(An C|B) 


Show that if A and B are independent, then A and B* as well as A‘ and B° are 
independent. 


Show that Ø is independent of A for any A. 
Show that if A and B are independent, then 
P(AU B) = P(A) 4 P(B) — P(A)P(B) 


If A is independent of B and B is independent of C, then A is independent of C. 
Prove this statement or give a counterexample if it is false. 


If A and B are disjoint, can they be independent? 
If A C B,can A and B be independent? 


Show that if A, B, and C are mutually independent, then A N B and C are 
independent and A U B and C are independent. 
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Suppose that n components are connected in series. For each unit, there is a 
backup unit, and the system fails if and only if both a unit and its backup fail. 
Assuming that all the units are independent and fail with probability p, what is 
the probability that the system works? For n = 10 and p = .05, compare these 
results with those of Example F in Section 1.6. 


A system has n independent units, each of which fails with probability p. The 
system fails only if k or more of the units fail. What is the probability that 
the system fails? 


What is the probability that the following system works if each unit fails inde- 
pendently with probability p (see Figure 1.5)? 


FIGURE 1.5 
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This problem deals with an elementary aspect of a simple branching process. A 
population starts with one member; at time f = 1, it either divides with prob- 
ability p or dies with probability 1 — p. If it divides, then both of its children 
behave independently with the same two alternatives at time t = 2. What is the 
probability that there are no members in the third generation? For what value of 
p is this probability equal to .5? 


Here is a simple model of a queue. The queue runs in discrete time (t = 
0, 1, 2, .. .), and at each unit of time the first person in the queue is served with 
probability p and, independently, a new person arrives with probability q. At 
time f = 0, there is one person in the queue. Find the probabilities that there are 
0, 1, 2, 3 people in line at time t = 2. 


A player throws darts at a target. On each trial, independently of the other trials, 
he hits the bull's-eye with probability .05. How many times should he throw so 
that his probability of hitting the bull’s-eye at least once is .5? 


This problem introduces some aspects of a simple genetic model. Assume that 
genes in an organism occur in pairs and that each member of the pair can be either 
of the types a or A. The possible genotypes of an organism are then AA, Aa, and 
aa (Aa and aA are equivalent). When two organisms mate, each independently 
contributes one of its two genes; either one of the pair is transmitted with prob- 
ability .5. 

a. Suppose that the genotypes of the parents are AA and Aa. Find the possible 

genotypes of their offspring and the corresponding probabilities. 
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Suppose that the probabilities of the genotypes AA, Aa, and aa are p, 2q, 
and r, respectively, in the first generation. Find the probabilities in the second 
and third generations, and show that these are the same. This result is called 
the Hardy-Weinberg Law. 


. Compute the probabilities for the second and third generations as in part (b) 


but under the additional assumption that the probabilities that an individual of 
type AA, Aa, or aa survives to mate are u, v, and w, respectively. 


79. Many human diseases are genetically transmitted (for example, hemophilia or 
Tay-Sachs disease). Here is a simple model for such a disease. The genotype 
aa is diseased and dies before it mates. The genotype Aa is a carrier but is not 
diseased. The genotype AA is not a carrier and is not diseased. 


80. 


a. 


b. 


If two carriers mate, what are the probabilities that their offspring are of each 
of the three genotypes? 

If the male offspring of two carriers is not diseased, what is the probability 
that he is a carrier? 


. Suppose that the nondiseased offspring of part (b) mates with a member of the 


population for whom no family history is available and who is thus assumed 
to have probability p of being a carrier ( p is a very small number). What are 
the probabilities that their first offspring has the genotypes AA, Aa, and aa? 


. Suppose that the first offspring of part (c) is not diseased. What is the proba- 


bility that the father is a carrier in light of this evidence? 


If a parent has genotype Aa, he transmits either A or a to an offspring (each with 
a 1 chance). The gene he transmits to one offspring is independent of the one 
he transmits to another. Consider a parent with three children and the following 
events: A = {children 1 and 2 have the same gene}, B = (children 1 and 3 have 
the same gene], C — (children 2 and 3 have the same gene]. Show that these 
events are pairwise independent but not mutually independent. 
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Discrete Random Variables 


A random variable is essentially a random number. As motivation for a definition, let 
us consider an example. A coin is thrown three times, and the sequence of heads and 
tails is observed; thus, 


Q = (hhh, hht, htt, hth, ttt, tth, thh, tht} 


Examples of random variables defined on Q are (1) the total number of heads, (2) the 
total number of tails, and (3) the number of heads minus the number of tails. Each of 
these is a real-valued function defined on Q; that is, each is a rule that assigns a real 
number to every point w € Q. Since the outcome in Q is random, the corresponding 
number is random as well. 

In general, a random variable is a function from & to the real numbers. Because 
the outcome of the experiment with sample space Q is random, the number produced 
by the function is random as well. It is conventional to denote random variables by 
italic uppercase letters from the end of the alphabet. For example, we might define 
X to be the total number of heads in the experiment described above. A discrete 
random variable is a random variable that can take on only a finite or at most a 
countably infinite number of values. The random variable X just defined is a discrete 
random variable since it can take on only the values 0, 1, 2, and 3. For an example of 
a random variable that can take on a countably infinite number of values, consider an 
experiment that consists of tossing a coin until a head turns up and defining Y to be 
the total number of tosses. The possible values of Y are 0, 1, 2, 3, .... In general, a 
countably infinite set is one that can be put into one-to-one correspondence with the 
integers. 

If the coin is fair, then each of the outcomes in €2 above has probability B 
from which the probabilities that X takes on the values 0, 1, 2, and 3 can be easily 
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computed: 
P(X=0)=; 
PX): 
P(X22 2 
P(X =3)= į 


Generally, the probability measure on the sample space determines the probabilities 
of the various values of X; if those values are denoted by x,, x», ..., then there is a 
function p such that p(x;) = P(X = x;) and $7; p(x;) = 1. This function is called 
the probability mass function, or the frequency function, of the random variable 
X. Figure 2.1 shows a graph of p(x) for the coin tossing experiment. The frequency 
function describes completely the probability properties of the random variable. 


AL 


p(x) 
io 








x 


FIGURE 2.1 A probability mass function. 


In addition to the frequency function, it is sometimes convenient to use the 
cumulative distribution function (cdf) of a random variable, which is defined to be 


F(x) = P(X x x), —oo«x < 


Cumulative distribution functions are usually denoted by uppercase letters and fre- 

quency functions by lowercase letters. Figure 2.2 is a graph of the cumulative distri- 

bution function of the random variable X of the preceding paragraph. Note that the cdf 

jumps wherever p(x) > 0 and that the jump at x; is p(x;). For example, if 0 < x < 
1 


1, F(x) = 4; atx = 1, F(x) jumpsto F(1) = $ = J. Thejumpatx = lis p(1) = $. 


The cumulative distribution function is non-decreasing and satisfies 
lim F(x) =0 and lim F(x) = 1. 
X—>—0O X— 00 


Chapter 3 will cover in detail the joint frequency functions of several random 
variables defined on the same sample space, but it is useful to define here the concept 
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F(x) 











FIGURE 2.2 The cumulative distribution function corresponding to Figure 2.1. 


of independence of random variables. In the case of two discrete random variables X 
and Y, taking on possible values x;, x2,..., and yj, y», ..., X and Y are said to be 
independent if, for all i and j, 


P(X = x; and Y = y;) = P(X = x;)P (Y ey) 


The definition is extended to collections of more than two discrete random variables 
in the obvious way; for example, X, Y, and Z are said to be mutually independent if, 
for all i, j, and k, 


PX =x Y =y Z5) PA SPY yP Sz) 





We next discuss some common discrete distributions that arise in applications. 


Bernoulli Random Variables 


A Bernoulli random variable takes on only two values: 0 and 1, with probabilities 
] — p and p, respectively. Its frequency function is thus 


pO) =p 
p©=1-p 
px)20,  ifx#0andx#1 


An alternative and sometimes useful representation of this function is 


_ J #a-—p'™, ifx=0orx=1 
p) = Ti otherwise 
If A is an event, then the indicator random variable, /,, takes on the value 1 if 
A occurs and the value 0 if A does not occur: 


l1, ifweA 
lœ) = lo otherwise 
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I4 is a Bernoulli random variable. In applications, Bernoulli random variables often 
occur as indicators. A Bernoulli random variable might take on the value 1 or O 
according to whether a guess was a success or a failure. 


The Binomial Distribution 


Suppose that n independent experiments, or trials, are performed, where n is a fixed 
number, and that each experiment results in a "success" with probability p and a 
"failure" with probability 1 — p. The total number of successes, X, is a binomial 
random variable with parameters n and p. For example, a coin is tossed 10 times and 
the total number of heads is counted (“head” is identified with success"). 

The probability that X = k, or p(k), can be found in the following way: Any 
particular sequence of k successes occurs with probability p*(1 — p)"~*, from the 
multiplication principle. The total number of such sequences is (2. since there are 
(7) ways to assign k successes to n trials. P(X = k) is thus the probability of any 
particular sequence times the number of such sequences: 


- n k n—k 
p(k) — (v a-p) 


Two binomial frequency functions are shown in Figure 2.3. Note how the shape varies 
as a function of p. 
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FIGURE 2.3 Binomial frequency functions, (a) n = 10 and p = .1 and (b) n = 10 
and p=.5. 


Tay-Sachs disease is a rare but fatal disease of genetic origin occurring chiefly in 
infants and children, especially those of Jewish or eastern European extraction. If a 
couple are both carriers of Tay-Sachs disease, a child of theirs has probability .25 of 
being born with the disease. If such a couple has four children, what is the frequency 
function for the number of children who will have the disease? 


EXAMPLE B 
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We assume that the four outcomes are independent of each other, so, if X denotes 
the number of children with the disease, its frequency function is 


4 k 4-k 
p&)-[,)25 x 7575, — k—0,1,2,3.4 


These probabilities are given in the following table: 


k p(k) 
0 316 
1 422 
2 aii 
3 .047 
4 .004 


If a single bit (0 or 1) is transmitted over a noisy communications channel, it has 
probability p of being incorrectly transmitted. To improve the reliability of the trans- 
mission, the bit is transmitted n times, where n is odd. A decoder at the receiving 
end, called a majority decoder, decides that the correct message is that carried by a 
majority of the received bits. Under a simple noise model, each bit is independently 
subject to being corrupted with the same probability p. The number of bits that is 
in error, X, is thus a binomial random variable with n trials and probability p of 
success on each trial (in this case, and frequently elsewhere, the word success is used 
in a generic sense; here a success is an error). Suppose, for example, that n = 5 and 
p = .1. The probability that the message is correctly received is the probability of 
two or fewer errors, which is 
2 
5 (7) p'Q — p) * = pa — p + 5p(1 — p) + 10p?(1— py = 9914 
k=0 


The result is a considerable improvement in reliability. | 


DNA Matching 

We continue Example G of Section 1.6. There we derived the probability p that two 
fragments agree at a particular site under the assumption that the nucleotide proba- 
bilities were the same at every site and the identities on fragment 1 were independent 
of those on fragment 2. To find the probability of the total number of matches, further 
assumptions must be made. Suppose that the fragments are each of length n and that 
the nucleotide identities are independent from site to site as well as between frag- 
ments. Thus, the identity of the nucleotide at site 1 of fragment 1 is independent of the 
identity at site 2, etc. We did not make this assumption in Example G in Section 1.6; 
in that case, the identity at site 2 could have depended on the identity at site 1, for 
example. Now, under the current assumption, the two fragments agree at each site 
with probability p as calculated in Example G of Section 1.6, and agreement is in- 
dependent from site to site. So, the total number of agreements is a binomial random 
variable with n trials and probability p of success. E 
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A random variable with a binomial distribution can be expressed in terms of inde- 
pendent Bernoulli random variables, a fact that will be quite useful for analyzing some 
properties of binomial random variables in later chapters of this book. Specifically, 
let X1, X5, ..., X, be independent Bernoulli random variables with p(X; = 1) = p. 
Then Y = X, + X; t---- + X, is a binomial random variable. 


The Geometric and Negative Binomial Distributions 


The geometric distribution is also constructed from independent Bernoulli trials, 
but from an infinite sequence. On each trial, a success occurs with probability p, and 
X is the total number of trials up to and including the first success. So that X = k, 
there must be k — 1 failures followed by a success. From the independence of the 
trials, this occurs with probability 


p(k) = P(X =k) 2 (1 — p)! p, k=1,2,3,... 


Note that these probabilities sum to 1: 


dd py 'p = pd py = 
k=1 j=0 


The probability of winning in a certain state lottery is said to be about l. If it is exactly 
3 the distribution of the number of tickets a person must purchase up to and including 
the first winning ticket is a geometric random variable with p = P Figure 2.4 shows 
the frequency function. E 


.12 F 
JO0F. 


08r . 


P(x) 


.06 F ° 
04 F 


02 F 





tesson, 
0t i i p tfe ae eeeeeeecog 


0 10 20 30 40 50 





FIGURE 2.4 The probability mass function of a geometric random variable with 
p E 
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The negative binomial distribution arises as a generalization of the geometric 
distribution. Suppose that a sequence of independent trials, each with probability of 
success p, is performed until there are r successes in all; let X denote the total number 
of trials. To find P(X = k), we can argue in the following way: Any particular such 
sequence has probability p'(1 — p)", from the independence assumption. The last 
trial is a success, and the remaining r — 1 successes can be assigned to the remaining 
k — | trials in (=) ways. Thus, 


k—-1 r k-r 
P(X—k)- (i) (l= p) 


It is sometimes helpful in analyzing properties of the negative binomial distribu- 
tion to note that a negative binomial random variable can be expressed as the sum of 
r independent geometric random variables: the number of trials up to and including 
the first success plus the number of trials after the first success up to and including 
the second success, .. . plus the number of trials from the (r — 1)st success up to and 
including the rth success. 


Continuing Example A, the distribution of the number of tickets purchased up to and 
including the second winning ticket is negative binomial: 


p(k) = (k-Dp'ü- p)? 


This frequency function is shown in Figure 2.5. a 
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FIGURE 2.5 The probability mass function of a negative binomial random variable 
with p = i andr = 2. 


The definitions of the geometric and negative binomial distributions vary slightly 
from one textbook to another; for example, instead of X being the total number of 
trials in the definition of the geometric distribution, X is sometimes defined as the 
total number of failures. 
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2.1.4 The Hypergeometric Distribution 


The hypergeometric distribution was introduced in Chapter 1 but was not named 
there. Suppose that an urn contains z balls, of which r are black and n —r are white. Let 
X denote the number of black balls drawn when taking m balls without replacement. 
Following the line of reasoning of Examples H and I of Section 1.4.2, 


9 


X is a hypergeometric random variable with parameters r, n, and m. 


EXAMPLE A Asexplained in Example G of Section 1.4.2, a player in the California lottery chooses 


6 numbers from 53 and the lottery officials later choose 6 numbers at random. Let X 
equal the number of matches. Then 


(i) sca) 
k/ N6—k 
P(X =k)= 
( ) 53 
6 
The probability mass function of X is displayed in the following table: 
k |o 4 2 3 4 5 6 


| a 
pík) | 468 401 .117 .014 7.06x 1074 122x102 4.36 x 1075 





2.1.5 The Poisson Distribution 


The Poisson frequency function with parameter X (A > 0) is 
AE 
P(X =k) = ze>, k —0, h Det: 


Since e* = 3 40. /Kk!) it follows that the frequency function sums to 1. Figure 2.6 
shows four Poisson frequency functions. Note how the shape varies as a function of À. 
The Poisson distribution can be derived as the limit of a binomial distribution as 
the number of trials, n, approaches infinity and the probability of success on each trial, 
p. approaches zero in such a way that np = A. The binomial frequency function is 


k RS n! val n—k 
p(k) = B spi (1— p) 


Setting np = À, this expression becomes 


n! Pd PR EA 
PY) = ví — b) (>) (1 =) 
M n! 1 ( *y( 2 
= zd 1 
k! (n — k)! n¥ n n 
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FIGURE 2.6 Poisson frequency functions, (a) à = .1, (b) A = 1, (c) à = 5, (d) à = 10. 





As n — oo, 
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and 


We thus have 





p(k) > 


which is the Poisson frequency function. 


Two dice are rolled 100 times, and the number of double sixes, X, is counted. The 
distribution of X is binomial with n — 100 and p — $ = .0278. Since n is large and p 
is small, we can approximate the binomial probabilities by Poisson probabilities with 
à = np = 2.78. The exact binomial probabilities and the Poisson approximations are 
shown in the following table: 











Binomial Poisson 
k Probability Approximation 
0 .0596 .0620 
1 1705 1725 
2 2414 .2397 
3 .2255 2221 
4 1564 1544 
5 0858 0858 
6 0389 .0398 
7 .0149 .0158 
8 .0050 .0055 
9 .0015 .0017 
10 .0004 .0005 
11 .0001 .0001 
The approximation is quite good. a 


The Poisson frequency function can be used to approximate binomial probabil- 
ities for large n and small p. This suggests how Poisson distributions can arise in 
practice. Suppose that X is a random variable that equals the number of times some 
event occurs in a given interval of time. Heuristically, let us think of dividing the 
interval into a very large number of small subintervals of equal length, and let us 
assume that the subintervals are so small that the probability of more than one event 
in a subinterval is negligible relative to the probability of one event, which is itself 
very small. Let us also assume that the probability of an event is the same in each 
subinterval and that whether an event occurs in one subinterval is independent of what 
happens in the other subintervals. The random variable X is thus nearly a binomial 
random variable, with the subintervals consitituting the trials, and, from the limiting 
result above, X has nearly a Poisson distribution. 

The preceding argument is not formal, of course, but merely suggestive. But, in 
fact, it can be made rigorous. The important assumptions underlying it are (1) what 
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happens in one subinterval is independent of what happens in any other subinterval, 
(2) the probability of an event is the same in each subinterval, and (3) events do not 
happen simultaneously. The same kind of argument can be made if we are concerned 
with an area or a volume of space rather than with an interval on the real line. 

The Poisson distribution is of fundamental theoretical and practical importance. 
It has been used in many areas, including the following: 


* The Poisson distribution has been used in the analysis of telephone systems. The 
number of calls coming into an exchange during a unit of time might be modeled 
as a Poisson variable if the exchange services a large number of customers who act 
more or less independently. 

* One of the earliest uses of the Poisson distribution was to model the number of 
alpha particles emitted from a radioactive source during a given period of time. 

* The Poisson distribution has been used as a model by insurance companies. For 
example, the number of freak acidents, such as falls in the shower, for a large popu- 
lation of people in a given time period might be modeled as a Poisson distribution, 
because the accidents would presumably be rare and independent (provided there 
was only one person in the shower). 

* The Poisson distribution has been used by traffic engineers as a model for light 
traffic. The number of vehicles that pass a marker on a roadway during a unit of 
time can be counted. If traffic is light, the individual vehicles act independently 
of each other. In heavy traffic, however, one vehicle's movement may influence 
another's, so the approximation might not be good. 


This amusing classical example is from von Bortkiewicz (1898). The number of 
fatalities that resulted from being kicked by a horse was recorded for 10 corps of 
Prussian cavalry over a period of 20 years, giving 200 corps-years worth of data. 
These data and the probabilities from a Poisson model with à = .61 are displayed 
in the following table. The first column of the table gives the number of deaths per 
year, ranging from 0 to 4. The second column lists how many times that number of 
deaths was observed. Thus, for example, in 65 of the 200 corps-years, there was one 
death. In the third column of the table, the observed numbers are converted to relative 
frequencies by dividing them by 200. The fourth column of the table gives Poisson 
probabilities with the parameter à = .61. In Chapters 8 and 9, we discuss how to 
choose a parameter value to fit a theoretical probability model to observed frequencies 
and methods for testing goodness of fit. For now, we will just remark that the value 
à = .61 was chosen to match the average number of deaths per year. 








Number of Deaths Relative Poisson 
per Year Observed Frequency Probability 
0 109 545 543 
1 65 .325 331 
2 22 110 101 
3 3 .015 .021 
4 1 .005 .003 = 
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The Poisson distribution often arises from a model called a Poisson process for 
the distribution of random events in a set S, which is typically one-, two-, or three- 
dimensional, corresponding to time, a plane, or a volume of space. Basically, this 
model states that if S1, $5, .. . , S, are disjoint subsets of S, then the numbers of events 
in these subsets, N;, N5, ..., N,, are independent random variables that follow Pois- 
son distributions with parameters À| S1], A|S5], ..., A|S,|, where |5;| denotes the mea- 
sure of S; (length, area, or volume, for example). The crucial assumptions here are that 
events in disjoint subsets are independent of each other and that the Poisson parameter 
for a subset is proportional to the subset's size. Later, we will see that this latter assump- 
tion implies that the average number of events in a subset is proportional to its size. 


Suppose that an office receives telephone calls as a Poisson process with A = .5 
per min. The number of calls in a 5-min. interval follows a Poisson distribution with 
parameter c = 54 = 2.5. Thus, the probability of no calls in a 5-min. interval is 
e ?? = ,082. The probability of exactly one call is 2.5e ?? = .205. = 


Figure 2.7 shows four realizations of a Poisson process with A = 25 in the unit square, 
0 <x < landO < y < 1. Itis interesting that the eye tends to perceive patterns, such 
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FIGURE 2.7 Four realizations of a Poisson process with A = 25. 
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as clusters of points and large blank spaces. But by the nature of a Poisson process, 
the locations of the points have no relationship to one another, and these patterns are 
entirely a result of chance. E 


Continuous Random Variables 


In applications, we are often interested in random variables that can take on a contin- 
uum of values rather than a finite or countably infinite number. For example, a model 
for the lifetime of an electronic component might be that it is random and can be 
any positive real number. For a continuous random variable, the role of the frequency 
function is taken by a density function, f (x), which has the properties that f (x) > 0, 
f is piecewise continuous, and JS f(x) dx = 1. If X is a random variable with 
a density function f, then for any a « b, the probability that X falls in the interval 
(a, b) is the area under the density function between a and b: 


b 
Pa<x<b)= | f(x) dx 


A uniform random variable on the interval [0, 1] is a model for what we mean when 
we say “choose a number at random between 0 and 1.” Any real number in the interval 
is a possible outcome, and the probability model should have the property that the 
probability that X is in any subinterval of length h is equal to A. The following density 
function does the job: 


0 x <Oorx>1 


Oats 0<x <1 


This is called the uniform density on [0, 1]. The uniform density on a general interval 
[a, b] is 


S cm 


x<aorx>b 


One consequence of this definition is that the probability that a continuous random 
variable X takes on any particular value is 0: 


pa 2o [ sa)dx =o 


Although this may seem strange initially, it is really quite natural. If the uniform 
random variable of Example A had a positive probability of being any particular 
number, it should have the same probability for any number in [0, 1], in which case 
the sum of the probabilities of any countably infinite subset of [0, 1] (for example, 
the rational numbers) would be infinite. If X is a continuous random variable, then 


P(a< X <b)=P(a<X <b)=P(a< X <b) 


Note that this is not true for a discrete random variable. 
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For small ô, if f is continuous at x, 


3 3 x+ô/2 
P(x-=<X<x+- =| f (u) du © ôf (x) 
( 2 2 ) x—8/2 
Therefore, the probability of a small interval around x is proportional to f (x). It is 
sometimes useful to employ differential notation: P(x < X <x+dx) = f(x) dx. 
The cumulative distribution function of a continuous random variable X is defined 
in the same way as for a discrete random variable: 


F(x) = P(X x x) 


F (x) can be expressed in terms of the density function: 


r2 f (u) du 


From the fundamental theorem of calculus, if f is continuous at x, f(x) = F'(x). 
The cdf can be used to evaluate the probability that X falls in an interval: 


b 
Pazxsp-[ f(x) dx = F(b) — F(a) 


From this definition, we see that the cdf of a uniform random variable on [0, 1] 
(Example A) is 


0, «<0 
ro [s O<x<1 = 


l, x-l 


Suppose that F is the cdf of a continuous random variable and is strictly increasing 
on some interval J, and that F = 0 to the left of J and F = | to the right of 7; I 
may be unbounded. Under this assumption, the inverse function F^ is well defined; 
x = F`! (y) if y = F(x). The pth quantile of the distribution F is defined to be that 
value x, such that F(x,) = p, or P(X < x,) = p. Under the preceding assumption 
stated, x, is uniquely defined as x, = F-!(p); see Figure 2.8. Special cases are 
p= p which corresponds to the median of F; and p = i and p — b which 
correspond to the lower and upper quartiles of F. 


Suppose that F(x) = x? for 0 < x < 1. This statement is shorthand for the more 
explicit statement 


0, x <0 
ra= [s O<x<l 
L, xml 
To find F~', we solve y = F(x) = x? for x, obtaining x = F^!(y) = Jy. The 
median is F^!(.5) = .707, the lower quartile is F~'(.25) = .50, and the upper 
quartile is F~'(.75) = .866. 8 
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FIGURE 2.8 A cdf F and F~. 


Value at Risk 

Financial firms need to quantify and monitor the risk of their investments. Value at 
Risk (VaR) is a widely used measure of potential losses. It involves two parameters: 
a time horizon and a level of confidence. For example, if the VaR of an institution is 
$10 million with a one-day horizon and a level of confidence of 95%, the interpretation 
is that there is a 5% chance of losses exceeding $10 million. Such a loss should be 
anticipated about once in 20 days. 

To see how VaR is computed, suppose the current value of the investment is Vo 
and the future value is Vi. The return on the investment is R = (Vj — Vo)/ Vo, which 
is modeled as a continuous random variable with cdf Fr (r). Let the desired level of 
confidence be denoted by 1 — o. We want to find v*, the VaR. Then 


a= P(V9— Vi > v^) 





Thus, —v*/ Vo is the a quantile, rą, and v* = — Vora. The VaR is minus the current 
value times the o quantile of the return distribution. E 


We next discuss some density functions that commonly arise in practice. 
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The exponential density function is 


|j Xe, x>0 

fe o x <0 
Like the Poisson distribution, the exponential density depends on a single parameter, 
A > 0, and it would therefore be more accurate to refer to it as the family of expo- 


nential densities. Several exponential densities are shown in Figure 2.9. Note that as 
À becomes larger, the density drops off more rapidly. 
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FIGURE 2.9 Exponential densities with 4 = .5 (solid), A = 1 (dotted), and a = 2 
(dashed). 


The cumulative distribution function is easily found: 


r2 f feu {joe x>0 


0, x <0 
The median of an exponential distribution, 7, say, is readily found from the cdf. We 
solve F (n) = 3: 


— pan — 1 
1—e —3 
from which we have 


log2 
uu 
The exponential distribution is often used to model lifetimes or waiting times, in 
which context it is conventional to replace x by t. Suppose that we consider modeling 
the lifetime of an electronic component as an exponential random variable, that the 
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component has lasted a length of time s, and that we wish to calculate the probability 
that it will last at least t more time units; that is, we wish to find P(T > t+s |T > s): 
P(T >t+sandT » s) 
P(T » s) 
P(T >t+s) 
= P(T >s) 
e ^tt5) 


P(T>t+s|T>s)= 








=e 


We see that the probability that the unit will last f more time units does not depend 
on s. The exponential distribution is consequently said to be memoryless; it is clearly 
not a good model for human lifetimes, since the probability that a 16-year-old will 
live at least 10 more years is not the same as the probability that an 80-year-old 
will live at least 10 more years. It can be shown that the exponential distribution 
is characterized by this memoryless property—that is, the memorylessness implies 
that the distribution is exponential. It may be somewhat surprising that a qualitative 
characterization, the property of memorylessness, actually determines the form of 
this density function. 

The memoryless character of the exponential distribution follows directly from 
its relation to a Poisson process. Suppose that events occur in time as a Poisson process 
with parameter A and that an event occurs at time tọ. Let T denote the length of time 
until the next event. The density of T can be found as follows: 


P(T >t) = P (no events in (fo, to + t)) 


Since the number of events in the interval (to, to + t), which is of length t, follows a 
Poisson distribution with parameter Af, this probability is e^^', and thus T follows an 
exponential distribution with parameter à. We can continue in this fashion. Suppose 
that the next event occurs at time 1;; the distribution of time until the third event is 
again exponential by the same analysis and, from the independence property of the 
Poisson process, is independent of the length of time between the first two events. 
Generally, the times between events of a Poisson process are independent, identically 
distributed, exponential random variables. 

Proteins and other biologically important molecules are regulated in various 
ways. Some undergo aging and are thus more likely to degrade when they are old 
than when they are young. If a molecule was not subject to aging, but its chance 
of degradation was the same at any age, its lifetime would follow an exponential 
distribution. 


Muscle and nerve cell membranes contain large numbers of channels through which 
selected ions can pass when the channels are open. Using sophisticated experimental 
techniques, neurophysiologists can measure the resulting current that flows through 
a single channel, and experimental records often indicate that a channel opens and 
closes at seemingly random times. In some cases, simple kinetic models predict that 
the duration of the open time should be exponentially distributed. 
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FIGURE 2.10 Histograms of open times at varying concentrations of 
suxamethonium and fitted exponential densities. 


Marshall et al. (1990) studied the action of a channel-blocking agent (suxa- 
methonium) on a channel (the nicotinic receptor of frog muscle). Figure 2.10 displays 
histograms of open times and fitted exponential distributions at a range of concentra- 
tions of suxamethonium. In this example, the exponential distribution is parametrized 
as f (t) = (1/t)exp(—t/t). t is thus in units of time, whereas A is in units of the 
reciprocal of time. From the figure, we see that the intervals become shorter and that 
the parameter t decreases with increasing concentrations of the blocker. It can also 
be seen, especially at higher concentrations, that very short intervals are not recorded 
because of limitations of the instrumentation. | 
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2.2.2 The Gamma Density 


EXAMPLEA 


The gamma density function depends on two parameters, œ and A: 


o 





[e DAE t> 0 
Po ° = 


Fort < 0, g(t) = 0. So that the density be well defined and integrate to 1, œ > 0 and 
à > 0. The gamma function, I (x), is defined as 


g(t) = 


r(x) = us le^" du, x0 
0 


Some properties of the gamma function are developed in the problems at the end of 
this chapter. 

Note that if & = 1, the gamma density coincides with the exponential density. 
The parameter o is called a shape parameter for the gamma density, and A is called 
a scale parameter. Varying o changes the shape of the density, whereas varying A 
corresponds to changing the units of measurement (say, from seconds to minutes) and 
does not affect the shape of the density. 

Figure 2.11 shows several gamma densities. Gamma densities provide a fairly 
flexible class for modeling nonnegative random variables. 


2.5. 
2.0 


1.5 


g(t) 


LOk, 


g(t) 

















(a) (b) 


FIGURE 2.11 Gamma densities, (a) œ = .5 (solid) and « = 1 (dotted) and (b) œ = 5 
(solid) and a = 10 (dotted); 4 = 1 in all cases. 


The patterns of occurrence of earthquakes in terms of time, space, and magnitude 
are very erratic, and attempts are sometimes made to construct probabilistic models 
for these events. The models may be used in a purely descriptive manner or, more 
ambitiously, for purposes of predicting future occurrences and consequent damage. 
Figure 2.12 shows the fit of a gamma density and an exponential density to the 
observed times separating a sequence of small earthquakes (Udias and Rice, 1975). 
The gamma density clearly gives a better fit (x = .509 and A = .00115). Note that an 
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FIGURE 2.12 Fit of gamma density (triangles) and of exponential density (circles) to 
times between microearthquakes. 


exponential model for interoccurrence times would be memoryless; that is, knowing 
that an earthquake had not occurred in the last ¢ time units would tell us nothing 
about the probability of occurrence during the next s time units. The gamma model 
does not have this property. In fact, although we will not show this, the gamma model 
with these parameter values has the character that there is a large likelihood that the 
next earthquake will immediately follow any given one and this likelihood decreases 
monotonically with time. E 


The Normal Distribution 


The normal distribution plays a central role in probability and statistics, for reasons 
that will become apparent in later chapters of this book. This distribution is also called 
the Gaussian distribution after Carl Friedrich Gauss, who proposed it as a model for 
measurement errors. The central limit theorem, which will be discussed in Chapter 6, 
justifies the use of the normal distribution in many applications. Roughly, the central 
limit theorem says that if a random variable is the sum of a large number of independent 
random variables, it is approximately normally distributed. The normal distribution 
has been used as a model for such diverse phenomena as a person’s height, the distribu- 
tion of IQ scores, and the velocity of a gas molecule. The density function of the normal 
distribution depends on two parameters, u and ø (where —oo < u < 00,0 > 0): 
1 


m —(x—p)* /207 _ 
f(x) = — =e A oO <r oo 
G 


J2n 
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The parameters u and c are called the mean and standard deviation of the normal 
density. 

The cdf cannot be evaluated in closed form from this density function (the integral 
that defines the cdf cannot be evaluated by an explicit formula but must be found 
numerically). A problem at the end of this chapter asks you to show that the normal 
density just given integrates to one. 

As shorthand for the statement “X follows a normal distribution with parameters 
pando, itis convenientto use X ^ N(p, o°). From the form of the density function, 
we see that the density is symmetric about u, f (u — x) = f(u + x), where it has a 
maximum, and that the rate at which it falls off is determined by o. Figure 2.13 shows 
several normal densities. Normal densities are sometimes referred to as bell-shaped 
curves. The special case for which u = 0 and ø = 1 is called the standard normal 
density. Its cdf is denoted by ® and its density by $ (not to be confused with the 
empty set). The relationship between the general normal density and the standard 
normal density will be developed in the next section. 


f(x) 
EN 
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FIGURE 2.13 Normal densities, u = 0 and ø = .5 (solid), 4 = 0 and o = 1 
(dotted), and u = 0 and o = 2 (dashed). 


Acoustic recordings made in the ocean contain substantial background noise. To de- 
tect sonar signals of interest, it is useful to characterize this noise as accurately as 
possible. In the Arctic, much of the background noise is produced by the cracking 
and straining of ice. Veitch and Wilks (1985) studied recordings of Arctic undersea 
noise and characterized the noise as a mixture of a Gaussian component and occa- 
sional large-amplitude bursts. Figure 2.14 is a trace of one recording that includes a 
burst. Figure 2.15 shows a Gaussian distribution fit to observations from a “quiet” 
(nonbursty) period of this noise. E 
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FIGURE 2.14 A record of undersea noise containing a large burst. 
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FIGURE 2.15 A histogram from a “quiet” period of undersea noise with a fitted 
normal density. 


Turbulent air flow is sometimes modeled as a random process. Since the velocity of 
the flow at any point is subject to the influence of a large number of random eddies 
in the neighborhood of that point, one might expect from the central limit theorem 
that the velocity would be normally distributed. Van Atta and Chen (1968) analyzed 
data gathered in a wind tunnel. Figure 2.16, taken from their paper, shows a normal 
distribution fit to 409,600 observations of one component of the velocity; the fit is 
remarkably good. E 


S&P 500 

The Standard and Poors 500 is an index of important U.S. stocks; each stock’s weight 
in the index is proportional to its market value. Individuals can invest in mutual funds 
that track the index. The top panel of Figure 2.17 shows the sequential values of the 
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FIGURE 2.16 A normal density (solid line) fit to 409,600 measurements of one 
component of the velocity of a turbulent wind flow. The dots show the values from a 


histogram. 
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FIGURE 2.17 Returns on the S&P 500 during 2003 (top panel) and a normal curve 
fitted to their histogram (bottom panel). The region area to the left of the 0.05 quantile 


is shaded. 
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returns during 2003. The average return during this period was 0.1% per day, and we 
can see from the figure that daily fluctuations were as large as 396 or 496. The lower 
panel of the figure shows a histogram of the returns and a fitted normal density with 
u = 0.001 and ø = 0.01. 

A financial company could use the fitted normal density in calculating its Value at 
Risk (see Example D of Section 2.2). Using a time horizon of one day and a confidence 
level of 95%, the VaR is the current investment in the index, Vo, multiplied by the 
negative of the 0.05 quantile of the distribution of returns. In this case, the quantile 
can be calculated to be —0.0165, so the VaR is .0165Vo. Thus, if Vo is $10 million, 
the VaR is $165,000. The company can have 95% "confidence" that its losses will 
not exceed that amount on a given day. However, it should not be surprised if that 
amount is exceeded about once in every 20 trading days. L| 


The Beta Density 


The beta density is useful for modeling random variables that are restricted to the 
interval [0, 1]: 
Tr(a +b) 


= a-1 b—1 
IO = Foro” d-u), O<u<l 





Figure 2.18 shows beta densities for various values of a and b. Note that the case 
a = b = |l is the uniform distribution. The beta distribution is important in Bayesian 
statistics, as you will see later. 


Functions of a Random Variable 


Suppose that a random variable X has a density function f(x). We often need to find 
the density function of Y = g(X) for some given function g. For example, X might 
be the velocity of a particle of mass m, and we might be interested in the probability 
density function of the particle's kinetic energy, Y — im X?. Often, the density and 
cdf of X are denoted by fy and Fy; and those of Y, by fy and Fy. To illustrate 
techniques for solving such a problem, we first develop some useful facts about the 
normal distribution. 

Suppose that X ~ N (u, o?) and that Y = aX +b, where a > 0. The cumulative 
distribution function of Y is 


Fy(y) = P(Y < y) 
= P(aX +b < y) 


—b 
-p(x <22) 
a 





Beta density 
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FIGURE 2.18 Beta density functions for various values of a and b: (a) a = 2, b = 2; (b) a = 6, b = 2; 
(c)a = 6, b = 6; and (d) a = .5, b= 4. 


Thus, 








Up to this point, we have not used the assumption of normality at all, so this result 
holds for a general continuous random variable, provided that Fy is appropriately 
differentiable. If fx is a normal density function with parameters u and o, we find 
that, after substitution, 


01 1 /y-b-aguWV 
ine Ae 


From this, we see that Y follows a normal distribution with parameters ay + b 
and ao. 

The case for which a < 0 can be analyzed similarly (see Problem 57 in the 
end-of-chapter problems), yielding the following proposition. 








PROPOSITION A 
If X ~ N(u, o?) and Y = aX +b, then Y ~ N(aw +b, a?o?). E 


This proposition is quite useful for finding probabilities from the normal dis- 
tribution. Suppose that X ~ N(u, c?) and we wish to find P(x) < X < xi) for 
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some numbers x; and xı. Consider the random variable 
X-u X u 


oO [03 o 


Z= 





Applying Proposition A with a = 1/o and b = —4/o, we see that Z ~ N(O, 1), 
that is, Z follows a standard normal distribution. Therefore, 











We thus have 
P(xo < X < xi) = Fx (xı) — Fx (xo) 


Rc re 


Thus, probabilities for general normal random variables can be evaluated in terms of 
probabilities for standard normal random variables. This is quite useful, since tables 
need to be made up only for the standard normal distribution rather than separately 
for every u and o. 





Scores on a certain standardized test, IQ scores, are approximately normally dis- 
tributed with mean u = 100 and standard deviation o = 15. Here we are referring 
to the distribution of scores over a very large population, and we approximate that 
discrete cumulative distribution function by a normal continuous cumulative distri- 
bution function. An individual is selected at random. What is the probability that his 
score X satisfies 120 < X < 130? 

We can calculate this probability by using the standard normal distribution as 
follows: 





120—100 X-—100  130- 100 
P020 < x < 130) = P ( < < ) 


15 15 15 
= P(1.33 < Z <2) 


where Z follows a standard normal distribution. Using a table of the standard normal 
distribution (Table 2 of Appendix B), this probability is 
P(1.33 < Z < 2) = ®(2) — (1.33) 
.9772 — .9082 
= .069 


Thus, approximately 7% of the population will have scores in this range. | 
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Let X ~ N(, o°), and find the probability that X is less than o away from p; that 
is, find P(|X — u| < o). 
This probability is 





X-u 
Pio <X-u<0)=P(-1 < <1) 
o 
= P(-1 < Z < 1) 


where Z follows a standard normal distribution. From tables of the standard normal 
distribution, this last probability is 


(1) — &(-1) = .68 


Thus, a normal random variable is within 1 standard deviation of its mean with 
probability .68. L| 


We now turn to another example involving the normal distribution. 


Find the density of X = Z?, where Z ~ N (0, 1). 
Here, we have 


Fx(x) = P(X < x) 
= P(-/x < Z < yx) 
= (Vx) — &(-/x) 


We find the density of X by differentiating the cdf. Since ®’(x) = $(x), the chain 
rule gives 


F(x) = 4x7 $3) + 5x76 (—/2) 
—x ex) 


In the last step, we used the symmetry of $. Evaluating the last expression, we 
find 
y 


X 
Ix) = Jir 





en, x20 

We can recognize this as a gamma density by making use of a principle of general 
utility. Suppose two densities are of the forms kh (x) and kA (x); then, because they 
both integrate to 1, kı = ky. Now, comparing the form of f (x) given here to that of 
the gamma density with a = A = 1, we recognize by this reasoning that f(x) is 
a gamma density and that T (1) = Jr. This density is also called the chi-square 
density with 1 degree of freedom. a 


As another example, let us consider the following. 
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EXAMPLE D  LetU be a uniform random variable on [0, 1], and let V = 1/U. To find the density 


of V, we first find the cdf: 


Fy(v) = P(V € v) 


This expression is valid for v > 1;forv < 1, Fy(v) = 0. We can now find the density 
by differentiation: 


1 
fv(v) 2 —5, l<v<o© 
v 


Looking back over these examples, we see that we have gone through the same 
basic steps in each case: first finding the cdf of the transformed variable, then dif- 
ferentiating to find the density, and then specifying in what region the result holds. 
These same steps can be used to prove the following general result. 


PROPOSITION B 


Let X be a continuous random variable with density f (x) andlet Y = g (X) where 
g is a differentiable, strictly monotonic function on some interval 7. Suppose that 
f(x) = 0 if x is not in J. Then Y has the density function 


d 
fry) = fx(g O eto 
y 


for y such that y = g(x) for some x, and fy (y) = Oif y Z g(x) for any x in I. 
Here g^! is the inverse function of g; that is, g^! (y) = x if y = g(x). m 


For any specific problem, proceeding from scratch is usually easier than deci- 
phering the notation and applying the proposition. 

We conclude this section by developing some results relating the uniform dis- 
tribution to other continuous distributions. Throughout, we consider a random vari- 
able X, with density f and cdf F, where F is strictly increasing on some interval 
I, F = Oto the left of 7, and F = 1 to the right of 7. I may be a bounded interval 
or an unbounded interval such as the whole real line. F^! (x) is then well defined 
for x € I. 
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PROPOSITION C 
Let Z = F(X); then Z has a uniform distribution on [0, 1]. 


Proof 
PZ N= EP) Sy ROK SO) IP) x 


This is the uniform cdf. L| 


PROPOSITION D 
Let U be uniform on [0, 1], and let X = F-!(U). Then the cdf of X is F. 


Proof 


PE ss = PEU) 5) PU a F= e E) - 


This last proposition is quite useful in generating pseudorandom numbers with 
a given cdf F. Many computer packages have routines for generating pseudorandom 
numbers that are uniform on [0, 1]. These numbers are called pseudorandom because 
they are generated according to some rule or algorithm and thus are not “really” 
random. Proposition D tells us that to generate random variables with cdf F, we just 
apply F^! to uniform random numbers. This is quite practical as long as F^! can be 
calculated easily. 


Suppose that, as part of simulation study, we want to generate random variables from 
an exponential distribution. For example, the performance of large queueing networks 
is often assessed by simulation. One aspect of such a simulation involves generating 
random time intervals between customer arrivals, which might be assumed to be 
exponentially distributed. If we have access to a uniform random number generator, 
we can apply Proposition D to generate exponential random numbers. The cdf is 
F(t) = 1 — e™. F-! can be found by solving x = 1 — e™ for t: 


e^" -21-—x 


—At — log(l — x) 
t = —log(l — x)/X 
Thus, if U is uniform on [0, 1], then T = —log(1 — U)/4 is an exponential random 


variable with parameter A. This can be simplified slightly by noting that V = 1 — U 
is also uniform on [0, 1] since 


PWV <v)=PU-U<v)=PU2>1-v=1-CU-v)=v 
We may thus take T = — log(V)/A, where V is uniform on [0, 1]. a 
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2.4 Concluding Remarks 


This chapter introduced the concept of a random variable, one of the fundamental 
ideas of probability theory. A fully rigorous discussion of random variables requires 
a background in measure theory. The development here is sufficient for the needs of 
this course. 

Discrete and continuous random variables have been defined, and it should be 
mentioned that more general random variables can also be defined and are useful on 
occasion. In particular, it makes sense to consider random variables that have both a 
discrete and a continuous component. For example, the lifetime of a transistor might 
be 0 with some probability p > 0 if it does not function at all; if it does function, the 
lifetime could be modeled as a continuous random variable. 


2.5 Problems 


1. Suppose that X isa discrete random variable with P(X = 0) = .25, P(X 21) = 
.125, P(X = 2) = .125, and P(X = 3) = .5. Graph the frequency function and 
the cumulative distribution function of X. 


2. An experiment consists of throwing a fair coin four times. Find the frequency 
function and the cumulative distribution function of the following random vari- 
ables: (a) the number of heads before the first tail, (b) the number of heads 
following the first tail, (c) the number of heads minus the number of tails, and 
(d) the number of tails times the number of heads. 


3. The following table shows the cumulative distribution function of a discrete 
random variable. Find the frequency function. 


k F(k) 
0 0 
1 1 
2 3 
3 7 
4 8 
5 1.0 


4. If X is an integer-valued random variable, show that the frequency function is 
related to the cdf by p(k) = F(k) — F(k — 1). 


5. Show that P(u < X < v) = F(v) — F (u) for any u and v in the cases that (a) 
X is a discrete random variable and (b) X is a continuous random variable. 
6. Let A and B be events, and let 7, and 75 be the associated indicator random 
variables. Show that 
Iang = lAlg = min(14, I5) 


and 
Iaug = max(l,, Ig) 
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. Find the cdf of a Bernoulli random variable. 


8. Show that the binomial probabilities sum to 1. 


10. 


11. 


12. 
13. 


14. 


15. 


16. 


17. 


18. 


. For what values of p is a two-out-of-three majority decoder better than transmis- 


sion of the message once? 


Appending three extra bits to a 4-bit word in a particular way (a Hamming code) 
allows detection and correction of up to one error in any of the bits. If each 
bit has probability .05 of being changed during communication, and the bits 
are changed independently of each other, what is the probability that the word 
is correctly received (that is, O or 1 bit is in error)? How does this probability 
compare to the probability that the word will be transmitted correctly with no 
check bits, in which case all four bits would have to be transmitted correctly for 
the word to be correct? 


Consider the binomial distribution with n trials and probability p of success on 
each trial. For what value of k is P(X = k) maximized? This value is called the 
mode of the distribution. (Hint: Consider the ratio of successive terms.) 


Which is more likely: 9 heads in 10 tosses of a fair coin or 18 heads in 20 tosses? 


A multiple-choice test consists of 20 items, each with four choices. A student is 
able to eliminate one of the choices on each question as incorrect and chooses 
randomly from the remaining three choices. A passing grade is 12 items or more 
correct. 


a. What is the probability that the student passes? 
b. Answer the question in part (a) again, assuming that the student can eliminate 
two of the choices on each question. 


Two boys play basketball in the following way. They take turns shooting and 
stop when a basket is made. Player A goes first and has probability pı of mak- 
ing a basket on any throw. Player B, who shoots second, has probability p» of 
making a basket. The outcomes of the successive trials are assumed to be inde- 
pendent. 


a. Find the frequency function for the total number of attempts. 
b. What is the probability that player A wins? 


Two teams, A and B, play a series of games. If team A has probability .4 of 
winning each game, is it to its advantage to play the best three out of five games 
or the best four out of seven? Assume the outcomes of successive games are 
independent. 


Show that if n approaches oo and r/n approaches p and m is fixed, the hyper- 
geometric frequency function tends to the binomial frequency function. Give a 
heuristic argument for why this is true. 


Suppose that in a sequence of independent Bernoulli trials, each with probability 
of success p, the number of failures up to the first success is counted. What is 
the frequency function for this random variable? 


Continuing with Problem 17, find the frequency function for the number of 
failures up to the rth success. 
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19 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


. Findanexpression for the cumulative distribution function of a geometric random 
variable. 
If X is a geometric random variable with p = .5, for what value of k is 


P(X < k) ~ .99? 


If X is a geometric random variable, show that 
P(X >n+k-—-1|X>n-1)=P(X>k) 


In light of the construction of a geometric distribution from a sequence of inde- 
pendent Bernoulli trials, how can this be interpreted so that it is “obvious”? 


Three identical fair coins are thrown simultaneously until all three show the same 
face. What is the probability that they are thrown more than three times? 


In a sequence of independent trials with probability p of success, what is the 
probability that there are r successes before the kth failure? 


(Banach Match Problem) A pipe smoker carries one box of matches in his left 
pocket and one box in his right. Initially, each box contains n matches. If he 
needs a match, the smoker is equally likely to choose either pocket. What is the 
frequency function for the number of matches in the other box when he first 
discovers that one box is empty? 


The probability of being dealt a royal straight flush (ace, king, queen, jack, and 
ten of the same suit) in poker is about 1.3 x 1078. Suppose that an avid poker 
player sees 100 hands a week, 52 weeks a year, for 20 years. 


a. What is the probability that she is never dealt a royal straight flush dealt? 
b. What is the probability that she is dealt exactly two royal straight flushes? 


The university administration assures a mathematician that he has only 1 chance in 
10,000 of being trapped in a much-maligned elevator in the mathematics building. 
If he goes to work 5 days a week, 52 weeks a year, for 10 years, and always rides 
the elevator up to his office when he first arrives, what is the probability that 
he will never be trapped? That he will be trapped once? Twice? Assume that 
the outcomes on all the days are mutually independent (a dubious assumption in 
practice). 


Suppose that a rare disease has an incidence of 1 in 1000. Assuming that members 
of the population are affected independently, find the probability of k cases in a 
population of 100,000 for k — 0, 1, 2. 


Let po, pi. ..., p, denote the probability mass function of the binomial distribu- 
tion with parameters n and p. Letq = 1— p. Show that the binomial probabilities 
can be computed recursively by po — q" and 


(n—k+1)p 
Pi. = —_, — 


Ee ke 2: us 
kq Pk-1 n 


Use this relation to find P(X < 4) for n = 9000 and p = .0005. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 
39. 


40. 
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Show that the Poisson probabilities po, pi, ... can be computed recursively by 
po = exp(—A) and 
A 


Pk = g Pk- k=1,2,... 


Use this scheme to find P(X < 4) for A = 4.5 and compare to the results of 
Problem 28. 


Suppose that in a city, the number of suicides can be approximated by a Poisson 

process with à = .33 per month. 

a. Find the probability of k suicides in a year for k = 0, 1, 2,.... What is the 
most probable number of suicides? 

b. What is the probability of two suicides in one week? 


Phone calls are received at a certain residence as a Poisson process with parameter 
à = 2 per hour. 


a. If Diane takes a 10-min. shower, what is the probability that the phone rings 
during that time? 

b. How long can her shower be if she wishes the probability of receiving no 
phone calls to be at most .5? 


For what value of k is the Poisson frequency function with parameter à maxi- 
mized? (Hint: Consider the ratio of consecutive terms.) 


Let F(x) = 1 — exp(—ax*) for x > 0,a > 0, B > 0, and F(x) = 0 for x < 0. 
Show that F is a cdf, and find the corresponding density. 


Let f(x) = (1 + ax)/2 for -1 < x < 1 and f(x) = 0 otherwise, where 
—1 <a x 1. Show that f is a density, and find the corresponding cdf. Find the 
quartiles and the median of the distribution in terms of o. 


Sketch the pdf and cdf of a random variable that is uniform on [— 1, 1]. 


If U isa uniform random variable on [0, 1], what is the distribution of the random 
variable X = [nU], where [t] denotes the greatest integer less than or equal to t? 


A line segment of length 1 is cut once at random. What is the probability that the 
longer piece is more than twice the length of the shorter piece? 


If f and g are densities, show that af + (1 — @)g is a density, where 0 < o < 1. 


The Cauchy cumulative distribution function is 
E... 
F(x)= = + — tan` (x), —0 <x < 
2 m 


a. Show that this is a cdf. 
b. Find the density function. 
c. Find x such that P(X > x) =.1. 


Suppose that X has the density function f (x) = cx? forü < x < land f(x) = 0 
otherwise. 


a. Find c. b. Find the cdf. c. What is P(.1 < X < .5)? 


68 


Chapter 2 Random Variables 


43. 


44. 


45. 


46. 
47. 
48. 
49. 


50. 


51. 


. Find the upper and lower quartiles of the exponential distribution. 


. Find the probability density for the distance from an event to its nearest neighbor 
for a Poisson process in the plane. 


Find the probability density for the distance from an event to its nearest neighbor 
for a Poisson process in three-dimensional space. 


Let T be an exponential random variable with parameter A. Let X be a discrete 
random variable defined as X = kif k < T <k+1,k = 0, 1,.... Find the 
frequency function of X. 


Suppose that the lifetime of an electronic component follows an exponential 
distribution with A = .1. 


a. Find the probability that the lifetime is less than 10. 
b. Find the probability that the lifetime is between 5 and 15. 
c. Find ¢ such that the probability that the lifetime is greater than f is .01. 


T is an exponential random variable, and P(T < 1) = .05. What is A? 
If à > 1, show that the gamma density has a maximum at (o — 1)/A. 
Show that the gamma density integrates to 1. 


The gamma function is a generalized factorial function. 

a. Show that l'(1) = 1. 

b. Show that P(x + 1) = xT (x). (Hint: Use integration by parts.) 
c. Conclude that T (n) = (n — 1), for n = 1,2,3,.... 

d. Use the fact that rG) = „/7 to show that, if n is an odd integer, 





n /7 (n — 1)! 
"(3) ^e 


Show by a change of variables that 


T(x) = 2 | Pre dt 
0 


oo 
=: 
= e"e* dt 
—0o 


Show that the normal density integrates to 1. (Hint: First make a change of 
variables to reduce the integral to that for the standard normal. The problem is 
then to show that f re exp(—x?/2) dx = v27 . Square both sides and reexpress 
the problem as that of showing 


(^ exp(—x?/2) dx) (^ exp( —»?/2) dv) = 2x7 


Finally, write the product of integrals as a double integral and change to polar 
coordinates.) 


52. 


53. 


54. 
55. 


56. 
57. 
58. 
59. 
60. 


61. 


62. 


63. 


64. 


65. 


66. 


67. 
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Suppose that in a certain population, individuals' heights are approximately nor- 
mally distributed with parameters u = 70 and o = 3 in. 


a. What proportion of the population is over 6 ft. tall? 
b. What is the distribution of heights if they are expressed in centimeters? In 
meters? 


Let X be a normal random variable with u = 5 ando. = 10. Find (a) P(X > 10), 
(b) P(—20 « X « 15), and (c) the value of x such that P(X > x) = .05. 


If X ~ N(u, 02), show that P(|X — u| < 6750) = .5 


X ^ N(u, 0°), find the value of c in terms of ø such that P(u —c < X < 
ph c)2.95. 


If X ~ N(0, o°), find the density of Y = |X|. 

X ^ N(n, o?) and Y = aX +b, wherea < 0, show that Y ~ N (au +b, a20?). 
If U is uniform on [0, 1], find the density function of JU. 

If U is uniform on [—1, 1], find the density function of U 2. 


Find the density function of Y — e”, where Z ~ N(u, o°). This is called the 
lognormal density, since log Y is normally distributed. 


Find the density of cX when X follows a gamma distribution. Show that only À 
is affected by such a transformation, which justifies calling A a scale parameter. 


Show that if X has a density function fy and Y = aX + b, then 
1 y—b 
fry) = — fx ( ) 


|a| a 





Suppose that © follows a uniform distribution on the interval [—7/2, 2/2]. Find 
the cdf and density of tan ©. 


A particle of mass m has a random velocity, V, which is normally distributed 
with parameters u = 0 and o. Find the density function of the kinetic energy, 
E = imV?. 

How could random variables with the following density function be generated 
from a uniform random number generator? 


lc ox 


[uU 





, 


Let f(x) = ax-*-! for x > 1 and f(x) = 0 otherwise, where o is a positive 
parameter. Show how to generate random variables from this density from a 
uniform random number generator. 


The Weibull cumulative distribution function is 
F(x) = 1—e7@/" x20, a> 0, B>O 


a. Find the density function. 
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68. 


69. 


70. 


71. 


72. 


b. Show that if W follows a Weibull distribution, then X = (W/o)? follows an 
exponential distribution. 

c. How could Weibull random variables be generated from a uniform random 
number generator? 


If the radius of a circle is an exponential random variable, find the density function 
of the area. 


If the radius of a sphere is an exponential random variable, find the density 
function of the volume. 


Let U be a uniform random variable. Find the density function of V = U™, 
a > 0. Compare the rates of decrease of the tails of the densities as a function of 
a. Does the comparison make sense intuitively? 


This problem shows one way to generate discrete random variables from a uni- 
form random number generator. Suppose that F is the cdf of an integer-valued 
random variable; let U be uniform on [0, 1]. Define a random variable Y = k if 
F(k — 1) « U x F(k). Show that Y has cdf F. Apply this result to show how 
to generate geometric random variables from uniform random variables. 


One of the most commonly used (but not one of the best) methods of gener- 
ating pseudorandom numbers is the linear congruential method, which works 
as follows. Let xo be an initial number (the "seed"). The sequence is generated 
recursively as 


X, = (ax, +c) mod m 


a. Choose values of a, c, and m, and try this out. Do the sequences “look” 
random? 

b. Making good choices of a, c, and m involves both art and theory. The follow- 
ing are some values that have been proposed: (1) a = 69069, c = 0, m = 23. 
(2)a 265539, c 2 0, m —2?!. The latter is an infamous generator called 
RANDU. Try out these schemes, and examine the results. 
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Joint Distributions 


Introduction 


This chapter is concerned with the joint probability structure of two or more random 
variables defined on the same sample space. Joint distributions arise naturally in many 
applications, of which the following are illustrative: 


In ecological studies, counts of several species, modeled as random variables, are 
often made. One species is often the prey of another; clearly, the number of predators 
will be related to the number of prey. 

The joint probability distribution of the x, y, and z components of wind velocity 
can be experimentally measured in studies of atmospheric turbulence. 

The joint distribution of the values of various physiological variables in a population 
of patients is often of interest in medical studies. 

A model for the joint distribution of age and length in a population of fish can be used 
to estimate the age distribution from the length distribution. The age distribution is 
relevant to the setting of reasonable harvesting policies. 


The joint behavior of two random variables, X and Y, is determined by the 
cumulative distribution function 


F(x,y)= P(X € x,Y x y) 


regardless of whether X and Y are continuous or discrete. The cdf gives the probability 
that the point (X, Y) belongs to a semi-infinite rectangle in the plane, as shown in 
Figure 3.1. The probability that (X, Y) belongs to a given rectangle is, from Figure 3.2, 


P(xy « X € xo, yy < Y € yo) = FQ, yg) = FG, yi) — FO, y») 
+ F(x, y1) 
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y2 





FIGURE 3.1 F(a, 
shaded rectangle. 


3.2 











b) gives the probability of the FIGURE 3.2 The probability of the shaded 
rectangle can be found by subtracting from the 
probability of the (semi-infinite) rectangle having 
the upper-right corner (x2, y2) the probabilities of 
the (x1, y2) and (x2, y1) rectangles, and then adding 
back in the probability of the (x1, y1) rectangle. 


The probability that (X, Y) belongs to a set A, for a large enough class of sets 
for practical purposes, can be determined by taking limits of intersections and unions 
of rectangles. In general, if X1, ..., X, are jointly distributed random variables, their 
joint cdf is 


Fx, X2, seas) = P(X, *€X1,X3 € X3. My X Xn) 

Two- and higher-dimensional versions of density functions and frequency func- 
tions exist. We will start with a detailed description of such functions for the discrete 
case, since it is the easier one to understand. 


Discrete Random Variables 


Suppose that X and Y are discrete random variables defined on the same sample 
space and that they take on values x1, x», ..., and yj, y», ..., respectively. Their 
joint frequency function, or joint probability mass function, p(x, y), is 





PQ, yj) = P(X = xi, Y = yj) 
A simple example will illustrate this concept. A fair coin is tossed three times; let X 


denote the number of heads on the first toss and Y the total number of heads. From 
the sample space, which is 


Q = (hhh, hht, hth, htt, thh, tht, tth, ttt} 
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we see that the joint frequency function of X and Y is as given in the following table: 











y 
x 0 1 2 3 
1 2 1 
0 § 8 8 0 
1 2 1 
1 0 8 8 8 





Thus, for example, p(0,2) = P(X 20, Y 22) = i. Note that the probabilities 
in the preceding table sum to 1. 

Suppose that we wish to find the frequency function of Y from the joint frequency 
function. This is straightforward: 


py (0) = PY =0) 


PY =0,X =0)+ P(Y =0,X =1) 
1 
=-+0 
8 
1 


8 

py(l) = P(Y = 1) 
= P(Y =1,X =0)+ P(Y =1,X = 1) 

3 
8 
In general, to find the frequency function of Y, we simply sum down the appropriate 
column of the table. For this reason, py is called the marginal frequency function 
of Y. Similarly, summing across the rows gives 


px(x) = 3 p(x, y) 


which is the marginal frequency function of X. 
The case for several random variables is analogous. If X,,..., Xm are discrete 
random variables defined on the same sample space, their joint frequency function is 


pu. D Xn) = P(X, = Xp. Xm = Xn) 
The marginal frequency function of X4, for example, is 


PX, (x1) E 5 pa, X2; ...5 Xm) 
The two-dimensional marginal frequency function of X; and X», for example, is 


Pxix (X1, x2) = 5 pa. X25, 5 Xp) 


X37 Xm 


Multinomial Distribution 
The multinomial distribution, an important generalization of the binomial distribution, 
arises in the following way. Suppose that each of n independent trials can result in 
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one of r types of outcomes and that on each trial the probabilities of the r outcomes 


are pi, po, ..., pr. Let Nj be the total number of outcomes of type i in the n trials, 
i = l,...,r. To calculate the joint frequency function, we observe that any particular 
sequence of trials giving rise to Nj; = nj, No = n»,..., N, = n, occurs with 


nı 


probability pj! p3? --- p. From Proposition C in Section 1.4.2, we know that there 
are n!/ (nj n5! - - - n, !) such sequences, and thus the joint frequency function is 


n n na 
pnd = ( ) ot oe 
ji ***H, 


The marginal distribution of any particular N; can be obtained by summing the joint 
frequency function over the other n ;. This formidable algebraic task can be avoided, 
however, by noting that N; can be interpreted as the number of successes in n trials, 
each of which has probability p; of success and 1 — p; of failure. Therefore, N; is a 
binomial random variable, and 


ny 
. i 


n nj n—1n. 
Pw, (ni) = (") (1— pi)" a 


The multinomial distribution is applicable in considering the probabilistic prop- 
erties of a histogram. As a concrete example, suppose that 100 independent ob- 
servations are taken from a uniform distribution on [0, 1], that the interval [0, 1] is 
partitioned into 10 equal bins, and that the counts n,, . . . , 219 in each ofthe 10 bins are 
recorded and graphed as the heights of vertical bars above the respective bins. The joint 
distribution of the heights is multinomial with n = 100 and p; = .1, i = 1,..., 10. 
Figure 3.3 shows four histograms constructed in this manner from a pseudorandom 
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FIGURE 3.3 Four histograms, each formed from 100 independent uniform random 
numbers. 
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number generator; the figure illustrates the sort of random fluctuations that can be 
expected in histograms. E 


Continuous Random Variables 


Suppose that X and Y are continuous random variables with a joint cdf, F (x, y). Their 
joint density function is a piecewise continuous function of two variables, f(x, y). 
The density function f (x, y) is nonnegative and f^. f% f(x, y) dy dx = 1. For 
any "reasonable" two-dimensional set A 


P.) e Ay» ff ro. dy dx 
A 


In particular, if A = ((X, Y)|X < x and Y < y}, 


re» f f f(u, v) dv du 


From the fundamental theorem of multivariable calculus, it follows that 


2 


ð 
fœ, y) = 3 y) 


wherever the derivative is defined. 
For small ô, and ô,, if f is continuous at (x, y), 


x+ôx yc, 
Pu sXextdnys¥syts)= | f f(u, v) dv du 
& f (x, y)8,8, 


Thus, the probability that (X, Y) is in a small neighborhood of (x, y) is proportional 
to f (x, y). Differential notation is sometimes useful: 


P(x<X<x+dx,y<Y <y+dy)= f(x,y) dx dy 


Consider the bivariate density function 


12 
f(x,y) = 7 G^ * xy), O<x<1, O<y<l 


which is plotted in Figure 3.4. P(X > Y) can be found by integrating f over the set 


(o, y)l0 < y < x < 1}: 
VAE E p. 
+f fe + xy) dy dx 
7 Jo Jo 
9 


= 2 a 
14 


P(X >Y) 
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FIGURE 3.4 The density function f(x, y) = ZG? xy, 0x <1, Ox yz 1. 


The marginal cdf of X, or Fy, is 


Fy (x) = P(X < x) 
E lim F(x, y) 


Í | J "updi 


From this, it follows that the density function of X alone, known as the marginal 
density of X, is 


fo Fe f fe dy 
In the discrete case, the marginal frequency function was found by summing the joint 


frequency function over the other variable; in the continuous case, it is found by 
integration. 


EXAMPLE B Continuing Example A, the marginal density of X is 


i2 J* 
a (x^ + xy) dy 
0 


= (ages) 
SS aque = 
7 2 


A similar calculation shows that the marginal density of Y is fy(y) = 
2G + y/2). u 


Ix (x) 


For several jointly continuous random variables, we can make the obvious gen- 
eralizations. The joint density function is a function of several variables, and the 
marginal density functions are found by integration. There are marginal density 


EXAMPLEC 
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functions of various dimensions. Suppose that X, Y, and Z are jointly continuous 
random variables with density function f(x, y, z). The one-dimensional marginal 
distribution of X is 


fe = | [ fern dy dz 


and the two-dimensional marginal distribution of X and Y is 


fxv (x, y) al f(x, y, z) dz 


Farlie-Morgenstern Family 
If F(x) and G (y) are one-dimensional cdfs, it can be shown that, for any o for which 
la| x 1, 


A(x, y) = FG)GQ)U + o[1 — FŒ] — GO] 
is a bivariate cumulative distribution function. Because lim F(x) = lim F(y) = 1, 
the marginal distributions are m TONS 
H(x,oo) = F(x) 
H (00, y) = GQ) 


In this way, an infinite number of different bivariate distributions with given marginals 
can be constructed. 

As an example, we will construct bivariate distributions with marginals that are 
uniform on [0, 1] [F(x) = x,0 < x < 1, and G(y) = y,0 < y x 1]. First, with 
a = —1, we have 

A(x, y) = xyl1 — (10 — x)(1 — y)] 
=x y+ yxy, seal, Gaye 


The bivariate density is 

32 
h(x, y) = 
9») OxQy 


= 2x + 2y — 4xy, O<x <1, O<y<l 





H(x, y) 


The density is shown in Figure 3.5. Perhaps you can imagine integrating over y 
(pushing all the mass onto the x axis) to produce a marginal uniform density for x. 
Next, if o = 1, 
H(x, y) = xy[l + (1 — x)(1 — y)] 
= 2xy — x?y — yx + x^y*, O<x <1, O<y<l 
The density is 
h(x, y) = 2— 2x — 2y + 4xy, O0<x <1, O0<y<l 


This density is shown in Figure 3.6. 
We just constructed two different bivariate distributions, both of which have 
uniform marginals. E 
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FIGURE 3.5 The joint density h(x, y) = 2x + 2y — 4xy, where 0 < x < 1 and 
O < y < 1, which has uniform marginal densities. 





FIGURE 3.6 The joint density h(x, y) = 2 — 2x — 2y + 4xy, where 0 < x < 1 and 
O < y < 1, which has uniform marginal densities. 


A copula is a joint cumulative distribution function of random variables that have 
uniform marginal distributions. The functions H (x, y) in the preceding example are 
copulas. Note that a copula C(u, v) is nondecreasing in each variable, because it 
is a cdf. Also, P(U < u) = C(u, 1) = u and C(1, v) = v, since the marginal 
distributions are uniform. We will restrict ourselves to copulas that have densities, in 
which case the density is 
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Now, suppose that X and Y are continuous random variables with cdfs Fx (x) 
and Fy(y). Then U = Fx (x) and V = Fy(y) are uniform random variables (Propo- 
sition 2.3C). For a copula C (u, v), consider the joint distribution defined by 


Fxy(x, y) = C(Fx(x), Fy (y)) 


Since C(F (x), 1) = Fx (x), the marginal cdfs corresponding to Fyy are Fy (x) and 
Fy (y). Using the chain rule, the corresponding density is 


fxr Œ, y) = (Fx), FrON KOLO) 


This construction points out that from the ingredients of two marginal distributions 
and any copula, a joint distribution with those marginals can be constructed. It is 
thus clear that the marginal distributions do not determine the joint distribution. The 
dependence between the random variables is captured in the copula. Copulas are not 
just academic curiousities—they have been extensively used in financial statistics in 
recent years to model dependencies in the returns of financial instruments. 


Consider the following joint density: 


Ale. O<x<y,A>0 
0, elsewhere 


fe. 


This joint density is plotted in Figure 3.7. To find the marginal densities, it is helpful 
to draw a picture showing where the density is nonzero to aid in determining the limits 
of integration (see Figure 3.8). 





FIGURE 3.7 The joint density of Example D. 
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FIGURE 3.8 The joint density of Example D is nonzero over the shaded region of 
the plane. 


First consider the marginal density fy (x) = Jo fxy(x, y)dy. Since f(x, y)=0 
for x > y, 


dde / Me dye, x > 0 


and we see that the marginal distribution of X is exponential. Next, because 
fxy (x, y) = 0 for x x Oandx > y, 


j 
fv(y) = n Me P dx 2 Aye, yz0 
0 


The marginal distribution of Y is a gamma distribution. E 


In some applications, it is useful to analyze distributions that are uniform over 
some region of space. For example, in the plane, the random point (X, Y) is uniform 
over a region, R, if for any A C R, 


P(X, Y) € A) = lal 
|R| 


where | | denotes area. 


EXAMPLE E A point is chosen randomly in a disk of radius 1. Since the area of the disk is 7, 


if x?--y? «1 
x, nan 
fe» | 0, otherwise 
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We can calculate the distribution of R, the distance of the point from the origin. R < r 
if the point lies in a disk of radius r. Since this disk has area zr’, 


zr? " 
Fg(r) = P(R Er) = — =r 
n 


The density function of R is thus fr(r) —2r,0 <r <1. 
Let us now find the marginal density of the x coordinate of the random point: 


fue) = f f.) ay 

1 A/1-—x2 

zJ dy 

X =f 1x2 

a ier, —l<x<1 
JU 


Note that we chose the limits of integration carefully; outside these limits the joint 
density is zero. (Draw a picture of the region over which f(x, y) > 0 and indicate 
the preceding limits of integration.) By symmetry, the marginal density of Y is 


2 
fpem lys -l<y<l a 


Bivariate Normal Density 
The bivariate normal density is given by the complicated expression 





jo 1 e«( [s = ux? | -urY 
i 2zmoxoy4/1l— o? 2( — p°) oF OF 





2p(x — ux) — aad 


OxOy 


One of the earliest uses of this bivariate density was as a model for the joint distribution 
of the heights of fathers and sons. The density depends on five parameters: 


—00 < [ly < oo —00 < uy < oo 
Ox —-0 oy > 0 
-l<p<l 


The contour lines of the density are the lines in the xy plane on which the joint density 
is constant. From the preceding equation, we see that f (x, y) is constant if 


C= bx)? + (y— ny» 2p(x — ux)(y — Hy) 


2 2 
Ox Oy OxOy 





— constant 


The locus of such points is an ellipse centered at (ux, Wy). If o = 0, the axes of the 
ellipse are parallel to the x and y axes, and if o + 0, they are tilted. Figure 3.9 shows 
several bivariate normal densities, and Figure 3.10 shows the corresponding elliptical 
contours. 
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FIGURE 3.9 Bivariate normal densities with wy = uy = 0 and ox = oy = 1 and (a) p = 0, (b) p = .3, 
(c) o = .6, (d) p = .9. 


The marginal distributions of X and Y are N (ux, ox) and N (uy, 9 respec- 
tively, as we will now demonstrate. The marginal density of X is 





re J Tewa 


Making the changes of variables u = (x — ux)/ox and v = (y — uy)/oy gives us 


1 5 5 
exp |-> ——5. (u^ + v^ — 2puv) | dv 
"[ 20-5 


1 oo 
fale) = 21 Ox \/ 1— p? ri 
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FIGURE 3.10 The elliptical contours of the bivariate normal densities of Figure 3.9. 


To evaluate this integral, we use the technique of completing the square. Using the 
identity 


u? + v? — 2puv = (v — pu +u?’ (1 — p?) 


we have 


oo 
—u?/2 


1 1 
— e exp | ——— ——. 
Tc n P| 20 — p5 
Finally, recognizing the integral as that of a normal density with mean pu and variance 
(1 — p°), we obtain 


fx(x) = (v — pu} | dv 


1 B [ = 52 2] 
Lr e 0/2 [o7 foi 
f Oxv 2x 
which is a normal density, as was to be shown. Thus, for example, the marginal 
distributions of x and y in Figure 3.9 are all standard normal, even though the joint 
distributions of (a)-(d) are quite different from each other. | 


We saw in our discussion of copulas earlier in this section that marginal densities 
do not determine joint densities. For example, we can take both marginal densities 
to be normal with parameters u = 0 and o = 1 and use the Farlie-Morgenstern 
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FIGURE 3.11 A bivariate density that has normal marginals but is not bivariate 
normal. The contours of the density shown in the xy plane are not elliptical. 


copula with density c(u, v) = 2 — 2u — 2v + 4uv. Denoting the normal density and 
cumulative distribution functions by $ (x) and ® (x), the bivariate density is 


fœ, y) = Q — 26(x) — 26(y) + 46 (x) 6 (y)ó x)o (y) 


This density and its contours are shown in Figure 3.11. Note that the contours are not 
elliptical. This bivariate density has normal marginals, but it is not a bivariate normal 
density. 


Independent Random Variables 


DEFINITION 


Random variables X;, X5, ..., X, are said to be independent if their joint cdf 
factors into the product of their marginal cdf’s: 


TAG Bip conn d62)) = 1g (Care (Ce) «oo JP (Gea) 


for 6l ate 3955 5 cota: Oo 


The definition holds for both continuous and discrete random variables. For 
discrete random variables, it is equivalent to state that their joint frequency function 
factors; for continuous random variables, it is equivalent to state that their joint density 
function factors. To see why this is true, consider the case of two jointly continuous 
random variables, X and Y. If they are independent, then 


F(x, y) = Fy (x) Fy(y) 


EXAMPLEA 


EXAMPLE B 
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and taking the second mixed partial derivative makes it clear that the density function 
factors. On the other hand, if the density function factors, then the joint cdf can be 
expressed as a product: 


re.» 2 f [a fx Q0) fy Qv) dv du 


= |f fx) du! |f fro) z = FeG)Fy(y) 


It can be shown that the definition implies that if X and Y are independent, then 
P(X€A,Y e B) 2 P(X e A)P(Y e B) 


It can also be shown that if g and h are functions, then Z = g(X) and W = A(Y) are 
independent as well. A sketch of an argument goes like this (the details are beyond 
the level of this course): We wish to find P(Z < z, W < w). Let A(z) be the set of 
x such that g(x) < z, and let B(w) be the set of y such that h(y) < w. Then 


P(Z <z,W < w) = P(X € A(z), Y € B(w)) 
= P(X € A(z))P (Y € B(w)) 
= P(Z <z)P(W < w) 


Suppose that the point (X, Y) is uniformly distributed on the square S = {(x, y) | 
— 1/2 < x < 1/2, —1/2 < y € 1/2}: fxy(@, y) = 1 for (x, y) in S and 0 elsewhere. 
Make a sketch of this square. You can visualize that the marginal distributions of X 
and Y are uniform on [—1/2, 1/2]. For example, the marginal density at a point x, 
—1/2 < x < 1/2is found by integrating (summing) the joint density over the vertical 
line that meets the horizontal axis at x. Thus, fx(x) = 1, —1/2 x x < 1/2 and 
fr(y) = Ll, and — 1/2 < y < 1/2. The joint density is equal to the product of the 
marginal densities, so X and Y are independent. You should be able to see from our 
sketch that knowing the value of X gives no information about the possible values 
of Y. L| 


Now consider rotating the square of the previous example by 90° to form a diamond. 
Sketch this diamond. From the sketch, you can see that the marginal density of X is 
nonnegative for —1/2 < x < 1/2 as before, but it is not uniform, and similarly for 
the marginal density of Y. Thus, for example, fx(.9) > 0 and fy (.9) > 0. But from 
the sketch you can also see that fyy(.9, .9) = 0. Thus, X and Y are not independent. 
Finally, the sketch shows you that knowing the value of X — for example, X = .9— 
constrains the possible values of Y. L| 
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Farlie-Morgenstern Family 

From Example C in Section 3.3, we see that X and Y are independent only if a = 0, 
since only in this case does the joint cdf H factor into the product of the marginals F 
and G. E 


If X and Y follow a bivariate normal distribution (Example F from Section 3.3) 
and p = 0, their joint density factors into the product of two normal densities, and 
therefore X and Y are independent. E 


Suppose that a node in a communications network has the property that if two packets 
of information arrive within time t of each other, they “collide” and then have to be 
retransmitted. If the times of arrival of the two packets are independent and uniform 
on [0, T], what is the probability that they collide? 

The times of arrival of two packets, T; and 75, are independent and uniform on 
[0, T], so their joint density is the product of the marginals, or 

1 
ft, 5) = T2 

for t; and t; in the square with sides [0, T ]. Therefore, (T, 75) is uniformly distributed 
over the square. The probability that the two packets collide is proportional to the 
area of the shaded strip in Figure 3.12. Each of the unshaded triangles of the figure 
has area (T — 1)? /2, and thus the area of the shaded area is T? — (T — rT)’. Integrating 
f (ti, t2) over this area gives the desired probability: 1 — (1 — c/ T)?. E 


ty 














t 
T T 1 


FIGURE 3.12 The probability that the two packets collide is proportional to the 
area of the shaded region |t; — t| < t 
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Conditional Distributions 


The Discrete Case 


If X and Y are jointly distributed discrete random variables, the conditional probability 
that X = x; given that Y = y; is, if py(y;) > 0, 

P(X = xi, Y= yj) 

P(Y = yj) 
u pxv Gi. y j) 
Py (yj) 

This probability is defined to be zero if py (y;) = 0. We will denote this conditional 
probability by pxy (x|y). Note that this function of x is a genuine frequency function 


since it is nonnegative and sums to | and that pyjx(y|x) = py(y) if X and Y are 
independent. 





We return to the simple discrete distribution considered in Section 3.2, reproducing 
the table of values for convenience here: 





= 
c Sd 
— 
N 
U 





— 
C or 





col colt 
oo|to CO] 
ole 


= 2 
Pxyy(0]1) 2 5 = 3 
8 
l 1 
8 
Px (| D) — dg = a 
8 3 


The definition of the conditional frequency function just given can be reexpressed 
as 


Pxy(X, y) = pxivGcly) py OQ) 


(the multiplication law of Chapter 1). This useful equation gives a relationship between 
the joint and conditional frequency functions. Summing both sides over all values of 
y, we have an extremely useful application of the law of total probability: 


px(x) = > py Gdy)pyQ) 


88 . Chapter 3 


EXAMPLE B 


3.5.2 


Joint Distributions 


Suppose that a particle counter is imperfect and independently detects each incoming 
particle with probability p. If the distribution of the number of incoming particles in 
a unit of time is a Poisson distribution with parameter à, what is the distribution of 
the number of counted particles? 

Let N denote the true number of particles and X the counted number. From the 
statement of the problem, the conditional distribution of X given N = n is binomial, 
with n trials and probability p of success. By the law of total probability, 


P(X =k) = 5 P(N =n)P(X =k|N =n) 


n=0 


OO An5—A 
Eus 


n=k 





= etre cee ae 
= qr pom 


n=k 


oo 


_ Qp} ey A — p) 
k! = gy 


(ap) e~e!) 
k! 
k 
_ CP) Qo 


We see that the distribution of X is a Poisson distribution with parameter Ap. This 
model arises in other applications as well. For example, N might denote the number 
of traffic accidents in a given time period, with each accident being fatal or nonfatal; 
X would then be the number of fatal accidents. | 


The Continuous Case 


In analogy with the definition in the preceding section, if X and Y are jointly contin- 
uous random variables, the conditional density of Y given X is defined to be 


fx (x) 


if 0 < fx(x) < oo, and 0 otherwise. This definition is in accord with the result to 
which a differential argument would lead. We would define fy\x(y|x) dy as P(y < 
Y <y+dy|x € X € x + dx) and calculate 
y) dx d f 
PO <Y < y+dy|x < X < x + dx) = fxr(x, y) dx dy _ fxr D d 
fx (x) dx fx(x) 

Note that the rightmost expression is interpreted as a function of y, x being fixed. 
The numerator is the joint density fxy(x, y), viewed as a function of y for fixed x: 
you can visualize it as the curve formed by slicing through the joint density function 


fvix lb») = 
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perpendicular to the x axis. The denominator normalizes that curve to have unit 
area. 

The joint density can be expressed in terms of the marginal and conditional 
densities as follows: 


fxv Gi, y= frix lx) fx) 


Integrating both sides over x allows the marginal density of Y to be expressed as 


oe 1 T 


which is the law of total probability for the continuous case. 


In Example D in Section 3.3, we saw that 


fxy GG y) 2 Xe”, O<x<y 
fux) Ae, x>0 
fry) -AXye^, yz0 


Let us find the conditional densities. Before doing the formal calculations, it is in- 
formative to examine the joint density for x and y, respectively, held constant. If x 
is constant, the joint density decays exponentially in y for y > x; if y is constant, 
the joint density is constant for 0 < x < y. (See Figure 3.7.) Now let us find the 
conditional densities according to the preceding definition. First, 


Ade 


= m —A(y—x) 
frix lx) = oie Ago c 


ye 
The conditional density of Y given X = x is exponential on the interval [x, oo). 
Expressing the joint density as 


fxvG, y= frix lx) fx) 


we see that we could generate X and Y according to fxy in the following way: First, 
generate X as an exponential random variable ( fx), and then generate Y as another 
exponential random variable ( fy; x) on the interval [x, oo). From this representation, 
we see that Y may be interpreted as the sum of two independent exponential random 
variables and that the distribution of this sum is gamma, a fact that we will derive 
later by a different method. 
Now, 
2 e? 1 
Faw Ql) = 13 Ay Ty: O<x<y 

The conditional density of X given Y = y is uniform on the interval [0, y]. Finally, 
expressing the joint density as 


fxv Gc y) = fev aly fr) 
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we see that alternatively we could generate X and Y according to the density fyy by 
first generating Y from a gamma density and then generating X uniformly on [0, y]. 
Another interpretation of this result is that, conditional on the sum of two independent 
exponential random variables, the first is uniformly distributed. E 


Stereology 

In metallography and other applications of quantitative microscopy, aspects of a three- 
dimensional structure are deduced from studying two-dimensional cross sections. 
Concepts of probability and statistics play an important role (DeHoff and Rhines 
1968). In particular, the following problem arises. Spherical particles are dispersed 
in a medium (grains in a metal, for example); the density function of the radii of 
the spheres can be denoted as fr(r). When the medium is sliced, two-dimensional, 
circular cross sections of the spheres are observed; let the density function of the radii 
of these circles be denoted by fx (x). How are these density functions related? 








FIGURE 3.13 A plane slices a sphere of radius r at a distance H from its center, 
producing a circle of radius x. 


To derive the relationship, we assume that the cross-sectioning plane is chosen 
at random, fix R = r, and find the conditional density fxjg(x|r). As shown in 
Figure 3.13, let H denote the distance from the center of the sphere to the planar cross 
section. By our assumption, H is uniformly distributed on [0, r], and X = Jr? — H?. 
We can thus find the conditional distribution of X given R — r: 


Fyjr(x|r) = P(X x x) 
= P(yr?— H? <x) 
= P(H > Vr? — x?) 
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Differentiating, we find 
T X|R (x|r) = 
r 


The marginal density of X is, from the law of total probability, 
fx) = J fxir |r) frír) dr 


29 x 
= sss fat) dr 
f r4/ r? — x? 

[The limits of integration are x and oo since forr < x, fxig(x|r) = 0.] This equation 
is called Abel’s equation. In practice, the marginal density fy can be approximated 
by making measurements of the radii of cross-sectional circles. Then the problem 
becomes that of trying to solve for an approximation to fr, since it is the distribution 
of spherical radii that is of real interest. E 


Bivariate Normal Density 
The conditional density of Y given X is the ratio of the bivariate normal density to a 
univariate normal density. After some messy algebra, this ratio simplifies to 


Oy 2 
y— by — p—(x — ux) 
1 1 Ox 


frix lx) = m "EU = cog exp 2 oi — p) 








This is a normal density with mean py + p(x — [Lx oy /ox and variance o7 (1 — p°). 
The conditional distribution of Y given X is a univariate normal distribution. 

In Example B in Section 2.2.3, the distribution of the velocity of a turbulent 
wind flow was shown to be approximately normally distributed. Van Atta and Chen 
(1968) also measured the joint distribution of the velocity at a point at two different 
times, ¢ and £t + t. Figure 3.14 shows the measured conditional density of the ve- 
locity, v2, at time ¢ + t, given various values of vı. There is a systematic departure 
from the normal distribution. Therefore, it appears that, even though the velocity 
is normally distributed, the joint distribution of vı and v; is not bivariate normal. 
This should not be totally unexpected, since the relation of v, and v; must con- 
form to equations of motion and continuity, which may not permit a joint normal 
distribution. B 


Example C illustrates that even when two random variables are marginally nor- 
mally distributed, they need not be jointly normally distributed. 
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FIGURE 3.14 The conditional densities of v) given v, for selected values of vi, 
where v; and v; are components of the velocity of a turbulent wind flow at different 
times. The solid lines are the conditional densities according to a normal fit, and the 
triangles and squares are empirical values determined from 409,600 observations. 


EXAMPLED Rejection Method 
The rejection method is commonly used to generate random variables from a density 
function, especially when the inverse of the cdf cannot be found in closed form 
and therefore the inverse cdf method, Proposition D in Section 2.3, cannot be used. 
Suppose that f is a density function that is nonzero on an interval [a, b] and zero 
outside the interval (a and b may be infinite). Let M(x) be a function such that 
M (x) => f (x) on [a, b], and let 


M (x) 


ls J? M(x) dx 


3.5 Conditional Distributions 93 


be a probability density function. As we will see, the idea is to choose M so that it is 
easy to generate random variables from m. If [a, b] is finite, m can be chosen to be 
the uniform distribution on [a, b]. The algorithm is as follows: 


Step 1: Generate T with the density m. 
Step 2: Generate U, uniform on [0, |] and independent of T. If M(T) xU x f(T), 
then let X = T (accept T). Otherwise, go to Step 1 (reject T). 


See Figure 3.15. From the figure, we can see that a geometrical interpretation of this 
algorithm is as follows: Throw a dart that lands uniformly in the rectangular region of 
the figure. If the dart lands below the curve f (x), record its x coordinate; otherwise, 
reject it. 


2 





accept / reject 














a 
- 
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FIGURE 3.15 Illustration of the rejection method. 


We must check that the density function of the random variable X thus obtained 
is in fact f: 


P(x < X <x+dx) = P(x € T € x t dx | accept) 
|| Past <x-+dx and accept) 











P (accept) 
P(acceptlx < T < x -dx)P(x <T € x - dx) 
~ P (accept) 
First consider the numerator of this expression. We have 
P(acceptix < T < x - dx) = PU < f(x)/M(x)) = s 


so that the numerator is 
m(x) dx f(x) _ f(x) dx 
M(x) f? M(x) dx 


From the law of total probability, the denominator is 





P(accept) = P(U < f(T)/M(T)) 


b 
t 1 
= fO m(t) dt = ———— 
a M(t) f, M(t) dt 
where the last two steps follow from the definition of m and since f integrates to 1. 
Finally, we see that the numerator over the denominator is f (x) dx. E 
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In order for the rejection method to be computationally efficient, the algorithm 
should lead to acceptance with high probability; otherwise, many rejection steps may 
have to be looped through for each acceptance. 


Bayesian Inference 

A freshly minted coin has a certain probability of coming up heads if it is spun on its 
edge, but that probability is not necessarily equal to h. Now suppose it is spun n times 
and comes up heads X times. What has been learned about the chance the coin comes 
up heads? We will go through a Bayesian treatment of this problem. Let © denote the 
probability that the coin will come up heads. We represent our knowledge about © 
before gathering any data by a probability density on [0, 1], called the prior density. 
If we are totally ignorant about ©, we might represent our state of knowledge by a 
uniform density on [0, 1]: 


fe(8) 21, 0<0 <1. 


We will see how observing X changes our knowledge about ©, transforming the prior 
distribution into a "posterior" distribution. 

Given a value 0, X follows a binomial distribution with n trials and probability 
of success 0: 


frole) = (") (0-8Y^7,  x-0,L...,n 
Now © is continuous and X is discrete, and they have a joint probability distribution: 
fe.x (8, x) = fxio@|9) fe (0) 
= (ea —-6y-,  x=0,1,...,n, 0<0<1 


This is a density function in 0 and a probability mass function in x, an object of a 
kind we have not seen before. We can calculate the marginal density X by integrating 


the joint over 0: 
l n I 
recy = | (D)ea- eras 
0 X 


We can calculate this formidable looking integral by a trick. First write 





n n! T(n + 1) 

x) x!n—x)! Fx-c-DPan-x-41D 
(If k is an integer, T (k) = (k — 1)!; see Problem 49 in Chapter 2). Recall the beta 
density (Section 2.2.4) 





(a+b 
se Ee Gees 
P(a)T (b) 
The fact that this density integrates to 1 tells us that 


1 diieu ds AEA 
0 I'(a +b) 
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Thus, identifying u with 0, a — 1 with x, and b — 1 with n — x, 











Pins e — 8y-*de 
Bees cÜr-xPBd "o 
B D(n 41) D(x-- DPn-x-1) 
~ Pec DPan-x-4D T(n 4-2) 
1 
= ; x= 0,1, ,n 
n+1 


Thus, if our prior on @ is uniform, each outcome of X is a priori equally likely. 
Our knowledge about © having observed X = x is quantified in the conditional 
density of © given X = x: 


fo,x(@, x) 
aix (Olx) = — 
feix (|x) fx) 


n 2 
= (at »( ea — gy 
x 


— Pi + D x n—x 
Web rra- "- ^ 





T(n 4-2) 
~ T(x4- DF(an—-x- 1) 





0* (1 oy" 


The relationship xT (x) = I (x + 1) has been used in the second step (see Problem 49, 
Chapter 2). Bear in mind that for each fixed x, this is a function of 0—the posterior 
density of 0 given x—which quantifies our opinion about © having observed x heads 
in n spins. The posterior density is a beta density with parameters a = x + 1, 
b-—n-—x-4l. 

A one-Euro coin has the number 1 on one face and a bird on the other face. I spun 
such a coin 20 times: the 1 came up 13 of the 20 times. Using the prior, © ~ U[0, 1], 
the posterior is beta with a = x + 1 = 14 and b = n — x + 1 = 8. Figure 3.16 shows 
this posterior, which represents my opinion if I was initially totally ignorant of 0 and 
then observed thirteen 1s in 20 spins. From the figure, it is extremely unlikely that 
0 < 0.25, for example. My probability, or belief, that 0 is greater than ; is the area 
under the density to the right of 1, which can be calculated to be 0.91. I can be 9196 
certain that @ is greater than 1. 

We need to distinguish between the steps of the preceding probability calcu- 
lations, which are are mathematically straightforward; and the interpretation of the 
results, which goes beyond the mathematics and requires a model that belief can 
be expressed in terms of probability and revised using the laws of probability. See 
Figure 3.16. | 
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FIGURE 3.16 Beta density with parameters a = 14 and b= 8. 


3.6 Functions of Jointly Distributed 


3.6.1 


Random Variables 


The distribution of a function of a single random variable was developed in Section 2.3. 
In this section, that development is extended to several random variables, but first some 
important special cases are considered. 


Sums and Quotients 


Suppose that X and Y are discrete random variables taking values on the integers 
and having the joint frequency function p(x, y), and let Z = X + Y. To find the 
frequency function of Z, we note that Z = z whenever X = x and Y = z — x, where 
x is an integer. The probability that Z — z is thus the sum over all x of these joint 


probabilities, or 
oo 


pz(z)= Y, po z—2) 


X-——00 


If X and Y are independent so that p(x, y) = px(x)py(y), then 


pz(z)= >> pxGOpy(z — x) 


X-——00 


This sum is called the convolution of the sequences px and py. 
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FIGURE 3.17. X +Y < z whenever (X, Y) is in the shaded region R;. 


The continuous case is very similar. Supposing that X and Y are continuous ran- 
dom variables, we first find the cdf of Z and then differentiate to find the density. Since 
Z < z whenever the point (X, Y) is in the shaded region R, shown in Figure 3.17, 


we have 
Fe) = ff f(x, y) dx dy 
R 


= / / f(x, y) dy dx 
In the inner integral, we make the change of variables y — v — x to obtain 


ro-f f f(x, v — x) dv dx 


-f f fav- drav 


Differentiating, we have, if Jj f (x, z — x) dx is continuous at z, 


pas J Jer- Dts 


which is the obvious analogue of the result for the discrete case. 
If X and Y are independent, 


ro=f fx x) fy (z — x) dx 


This integral is called the convolution of the functions fy and fy. 
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Suppose that the lifetime of a component is exponentially distributed and that an 
identical and independent backup component is available. The system operates as 
long as one of the components is functional; therefore, the distribution of the life of 
the system is that of the sum of two independent exponential random variables. Let 
T, and T, be independent exponentials with parameter A, and let S = 7, + T5. 


fs(s) = f Ae ne Oat 
0 


It is important to note the limits of integration. Beyond these limits, one of the two 
component densities is zero. When dealing with densities that are nonzero only on 
some subset of the real line, we must always be careful. Continuing, we have 


fos) = 22 I e^ dt 
0 


» = 
= X se™ 


This is a gamma distribution with parameters 2 and à (compare with Example A in 
Section 3.5.2). " 


Let us next consider the quotient of two continuous random variables. The deriva- 
tion is very similar to that for the sum of such variables, given previously: We first find 
the cdf and then differentiate to find the density. Suppose that X and Y are continuous 
with joint density function f and that Z = Y/X. Then Fz(z) = P(Z < z) is the 
probability of the set of (x, y) such that y/x < z. If x > 0, this is the set y < xz; if 
x < 0, itis the set y > xz. Thus, 


0 oo oo XZ 
ros-[ | re»aace] f fæ) dyas 
—oo J xz 0 —eo 


To remove the dependence of the inner integrals on x, we make the change of vari- 
ables y — xv in the inner integrals and obtain 


0 —oo oo Z 
Fz(z) ET J xf (x, xv) dv ax+ f J xf (x, xv) dv dx 
—00 Jz 0 —oo 
0 Z oo Z 
s J (=x) f (x, xv) dv ax+ f J xf (x, xv) dv dx 
—00 y —00 0 —oo 
2 ril |x| f (x, xv) dx dv 


Finally, differentiating (again under an assumption of continuity), we find 


pae / xl f(x, xz) dx 


co 


In particular, if X and Y are independent, 


pe J Lx Fo) fr(xz) dx 


EXAMPLE B 


3.6.2 


EXAMPLEA 
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Suppose that X and Y are independent standard normal random variables and that 
Z = Y/X. We then have 


fz(z) =, [el eager dx 
65 210 


oo 


From the symmetry of the integrand about zero, 
] fe 2 
fz = =f xe @+D/) dy 
7T Jo 
To simplify this, we make the change of variables u = x? to obtain 


dest J Cem me dy 
2m 0 


Next, using the fact that lh Xexp(—Ax) dx = 1 with A = (z? + 1)/2, we get 


1 


fz(z)-— zu n 





00 < 7< 00 


This density is called the Cauchy density. Like the standard normal density, the 
Cauchy density is symmetric about zero and bell-shaped, but the tails of the Cauchy 
tend to zero very slowly compared to the tails of the normal. This can be interpreted 
as being because of a substantial probability that X in the quotient Y/X is near 
Zero. E 


Example B indicates one method of generating Cauchy random variables—we 
can generate independent standard normal random variables and form their quotient. 
The next section shows how to generate standard normals. 


The General Case 


The following example illustrates the concepts that are important to the general case 
of functions of several random variables and is also interesting in its own right. 


Suppose that X and Y are independent standard normal random variables, which 
means that their joint distribution is the standard bivariate normal distribution, or 


fxv(x, y) = Y 761-02 
; Qn 


We change to polar coordinates and then reexpress the density in this new coordinate 
system (R > 0,0 < © < 27x): 


R=VX +Y? 
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tan ! (E), ifX >0 
tan"! (2) +r, if X <0 
5sgn(Y), if X =0,Y 40 


0, ifX =0,Y =0 


T 


(The range of the inverse tangent function is taken to be — 3 


transformation is 


< © < 5.) The inverse 


X = RcosO 
Y = RsinO 
The joint density of R and © is 
fro(r, 0) dr d0 = P(r < R <r +dr,0 < © x 0-4 d0) 


This probability is equal to the area of the shaded patch in Figure 3.18 times 
fxylx(r, 0), y(r, 0)]. The area in question is clearly r dr d0, so 


P(r < R<r+dr,0 < © <0 +d0)= fxy(rcos0,r sinO)r dr d0 


and 
fno(r, 0) =rfxy(r cos 0, r sin) 
Thus, 
r ecd 222 
frolr, 0) = gj (r? cos? 6) /2—(r? sin? 6)/2] 
= tern 
2x 


From this, we see that the joint density factors implying that R and O are independent 
random variables, that © is uniform on [0, 277], and that R has the density 


fer) sre", ren 


which is called the Rayleigh density. 








r rtdr 





FIGURE 3.18 The area of the shaded patch is rdr dð. 
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An interesting relationship can be found by changing variables again, letting 
T — R?. Using the standard techniques for finding the density of a function of a 
single random variable, we obtain 


1 
frt) = "E t>0 


This is an exponential distribution with parameter i Because R and © are indepen- 
dent, so are T and ©, and the joint density of the latter pair is 


1 1 
a(f,0)2 — | = =t/2 
fret, 9) on (5) e 


We have thus arrived at a characterization of the standard bivariate normal distribu- 
tion: © is uniform on [0, 27], and R? is exponential with parameter 5. (Also, from 
Example B in Section 3.6.1, tan © follows a Cauchy distribution.) 

These relationships can be used to construct an algorithm for generating standard 
normal random variables, which is quite useful since ®, the cdf, and ^! cannot be 
expressed in closed form. First, generate U; and U2, which are independent and 
uniform on [0, 1]. Then —2 log U, is exponential with parameter P and 27 U3 is 
uniform on [0, 27 ]. It follows that 


X = y —2 log U; cos(2z U2) 
and 


Y = J —2 log U; sin(2z U2) 


are independent standard normal random variables. This method of generating nor- 
mally distributed random variables is sometimes called the polar method. E 


For the general case, suppose that X and Y are jointly distributed continuous 
random variables, that X and Y are mapped onto U and V by the transformation 


v = g(x, y) 
and that the transformation can be inverted to obtain 
x = hu, v) 
y = hu, v) 
Assume that gı and g, have continuous partial derivatives and that the Jacobian 
às 981 
Ox ə a ə ð ð 
J (x, y) = det Fis Ecce S HESS | S80 
082 0g» Ox dy Ox dy 


ax ay. 





for all x and y. This leads directly to the following result. 
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PROPOSITION A 
Under the assumptions just stated, the joint density of U and V is 
fuv Qu, v) = fxy (hı (u, v), hou, v))|J™" Qn Qu, v), how, v))| 


for (u, v) such that u = gı(x, y) and v = g»(x, y) for some (x, y) and 0 
elsewhere. a 


We will not prove Proposition A here. It follows from the formula established 
in advanced calculus for a change of variables in multiple integrals. The essential 
elements of the proof follow the discussion in Example A. 


To illustrate the formalism, let us redo Example A. The roles of u and v are played 
by r and 6: 


r= /x? + y? 
0 = tan"! (=) 
x 


x =rcosdé 


The inverse transformation is 


y=rsing 


After some algebra, we obtain the partial derivatives: 


or u x or u y 
à yt y Vat $y? 
900 —y 00 X 
dx x2 + y2 dy x+y? 
The Jacobian is the determinant of the matrix of these expressions, or 
[OM re 
d x+y r 


Proposition A therefore says that 
fro(r, 0) = rfxy(r cos 6, r sind) 


forr > 0,0 < @ < 2z,and 0 elsewhere, which is the same as the result we obtained 
by a direct argument in Example A. a 


Proposition A extends readily to transformations of more than two random vari- 
ables. If X,,..., X, have the joint density function fy,..x, and 


Y; = gi(X1,..., Xn), Dl; 
X; — hi(Yy. +4 Ya); i=1l,...,n 


EXAMPLEC 
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and if J (x1, ..., Xn) is the determinant of the matrix with the ij entry dg;/dx;, then 
the joint density of Yi, ..., Y, is 

fry, On; ttt Yn) = POSALE ttt mln E Xa) 
wherein each x; is expressed in terms of the y's; x; = hi (Yi; ..., Yn)- 


Suppose that X, and X, are independent standard normal random variables and that 
Yi = Xi 
Y; = Xı X5 


We will show that the joint distribution of Y; and Y; is bivariate normal. The Jacobian 
of the transformation is simply 


fo.» = dent] = 


Since the inverse transformation is x; = y; and x2 = y2 — yi, from Proposition A the 
joint density of Y, and Y» is 


u 1 1 2 2 
fani, y») s E E T»-— yı) ] 


1 


LI exp a (2y1 + y2 - 2y1y2) 
Qn gv ee 


This can be recognized to be a bivariate normal density, the parameters of which can 
be identified by comparing the constants in this expression with the general form of the 
bivariate normal (see Example F of Section 3.3). First, since the exponential contains 
only quadratic terms in y; and y2, we have wy, = fy, = O. (If uy, were nonzero, 
for example, examination of the equation for the bivariate density in Example F of 
Section 3.3 shows that there would be a term y;/1y,.) Next, from the constant that 
occurs in front of the exponential, we have 


Oy, Oy, V 1— p? =1 


From the coefficient of y; we have 


2 2 I 
ol ae es 


Dividing the second relationship into the square of the first gives Oy, = 2. From the 
coefficient of y», we have 
2 2 
o7,(1— p?) =1 

from which it follows that p? = 4. 

From the sign of the cross product, we see that p = 1/4/2. Finally, we have 
oy, = 1. We thus see that this linear transformation of two independent standard 
normal random variables follows a bivariate normal distribution. This is a special 
case of a more general result: A nonsingular linear transformation of two random 
variables whose joint distribution is bivariate normal yields two random variables 
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whose joint distribution is still bivariate normal, although with different parameters. 
(See Problem 58.) a 


3.7 Extrema and Order Statistics 


EXAMPLEA 


EXAMPLE B 


This section is concerned with ordering a collection of independent continuous 
random variables. In particular, let us assume that X1, X2,..., X, are independent 
random variables with the common cdf F and density f. Let U denote the maximum 
of the X; and V the minimum. The cdfs of U and V, and therefore their densities, 
can be found by a simple trick. 

First, we note that U < u if and only if X; < u for all i. Thus, 


Fy(u) = P(U < u) 
= P(X, S u)P(X; <u)--- P(X, < u) 
—-[FG)l 
Differentiating, we find the density, 
fu(u) = nf (LFQ)Y" 
Similarly, V > v if and only if X; > v for all i. Thus, 
1 — Fy(v) = [1 - FOI 
and 
Fy(v) 21- [1 — FQ) 
The density function of V is therefore 


fv(v) 2 nf (D — FQ] 


Suppose that n system components are connected in series, which means that the sys- 
tem fails if any one of them fails, and that the lifetimes of the components, 7T, ..., Th, 
are independent random variables that are exponentially distributed with parameter A: 
F(t) = 1—e™ . The random variable that represents the length of time the system op- 
erates is V, which is the minimum of the T; and by the preceding result has the density 


fy (v) = nde (ey 


= nie "^ 


We see that V is exponentially distributed with parameter nA. H 


Suppose that a system has components as described in Example A but connected 
in parallel, which means that the system fails only when they all fail. The system's 
lifetime is thus the maximum of n exponential random variables and has the 
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density 
fu (u) - nae“ (1 = g ye 


By expanding the last term using the binomial theorem, we see that this density is a 
weighted sum of exponential terms rather than a simple exponential density. a 


We will now derive the preceding results once more, by the differential technique, 
and generalize them. To find fy (u), we observe that u < U < u + du if one of the 
nX; falls in the interval (u, u + du) and the other (n — 1)X; fall to the left of u. 
The probability of any particular such arrangement is [F (u) |"! f (u)du, and because 
there are n such arrangements, 


fu(u) = nF (U)! f ao 


Now we again assume that X;,..., X, are independent continuous random vari- 
ables with density f(x). We sort the X; and denote by Xa) < Xo) <--- < Xm the 
order statistics. Note that X, is not necessarily equal to Xq). (In fact, this equality 
holds with probability n^!.) Thus, Xm) is the maximum, and Xj is the minimum. If 
n is odd, say, n = 2m + 1, then Xm+1) is called the median of the X;. 


THEOREM A 
The density of Xœ), the kth-order statistic, is 


= mt k—1 E n—k 
f) = qs gi P OU - FOI 


Proof 


We will use a differential argument to derive this result heuristically. (The alter- 
native approach of first deriving the cdf and then differentiating is developed in 
Problem 66 at the end of this chapter.) The event x < Xœ < x + dx occurs if 
k — 1 observations are less than x, one observation is in the interval [x, x + dx], 
and n — k observations are greater than x + dx. The probability of any particular 
arrangement of this type is f(x) F^ (x)[1 — F(x)]"-*dx, and, by the multi- 
nomial theorem, there are n!/[(k — 1)!1!(n — k)!] such arrangements, which 
completes the argument. E 


EXAMPLE C For the case where the X; are uniform on [0, 1], the density of the kth-order statistic 
reduces to 
n! 


xcu 0<x<1 
(k — 1)!(n — k)! 





This is the beta density. An interesting by-product of this result is that since the 
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density integrates to 1, 


f Pisos (k — 1)!(n — k)! 
0 


n! 


Joint distributions of order statistics can also be worked out. For example, to find 
the joint density of the minimum and maximum, we note that x < Xa) < x + dx 
and y < Xj) < y + dy if one X; falls in [x, x + dx], one falls in [y, y + dy], and 
n — 2 fall in [x, y]. There are n(n — 1) ways to choose the minimum and maximum, 
and thus V = Xq) and U = Xn have the joint density 


f(u, v) 2 n(n — D f0) f DLE u) — Fw ?, u>v 
For example, for the uniform case, 
fu, v) 2 n(n — D(u — v)", l>u>v>0 


The range of X(1),..., X is R = Xm) — Xa). Using the same kind of analysis we 
used in Section 3.6.1 to derive the distribution of a sum, we find 


$8 aj ESET 


Find the distribution of the range, U — V, for the uniform [0, 1] case. The integrand 
is f (v +r, v) = n(n — lr"? for0 < v < v+r < lor, equivalently, 0 < v < 1—r. 
Thus, 


l-r 
frr) = T n(n — Dr"? dv 2 n(n — Dr"?(1- r), O<r<l 
0 


The corresponding cdf is 


Fe(r) = nr"! — (n — Dr", O0<r<l a 


Tolerance Interval 

If a large number of independent random variables having the common density func- 
tion f are observed, it seems intuitively likely that most of the probability mass of 
the density f(x) is contained in the interval (Xa), Xn) and unlikely that a future 
observation will lie outside this interval. In fact, very precise statements can be made. 
For example, the amount of the probability mass in the interval is F(X(,)) — F(X), 
a random variable that we will denote by Q. From Proposition C of Section 2.3, the 
distribution of F(X;) is uniform; therefore, the distribution of Q is the distribution 
of Um — Uaj, which is the range of n independent uniform random variables. Thus, 
P(Q > a), the probability that more than 100a% of the probability mass is contained 
in the range is from Example D, 


P(Q-o)21-na"^! + (n — Do" 


3.8 
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For example, if 1 = 100 and a = .95, this probability is .96. In words, this means 
that the probability is .96 that the range of 100 independent random variables covers 
95% or more of the probability mass, or, with probability .96, 95% of all further 
observations from the same distribution will fall between the minimum and maximum. 
This statement does not depend on the actual form of the distribution. E 


Problems 


1. The joint frequency function of two discrete random variables, X and Y, is given 
in the following table: 








X 
y 1 2 3 4 
1 10 05 02 02 
2 05 20 05 02 
3 02 05 20 04 
4 02 02 04 10 





a. Find the marginal frequency functions of X and Y. 
b. Find the conditional frequency function of X given Y = 1 and of Y given 
X X 


2. An urn contains p black balls, q white balls, and r red balls; and n balls are 
chosen without replacement. 


a. Find the joint distribution of the numbers of black, white, and red balls in the 
sample. 

b. Find the joint distribution of the numbers of black and white balls in the 
sample. 

c. Find the marginal distribution of the number of white balls in the sample. 


3. Three players play 10 independent rounds of a game, and each player has prob- 
ability i of winning each round. Find the joint distribution of the numbers of 
games won by each of the three players. 


4. A sieve is made of a square mesh of wires. Each wire has diameter d, and the holes 
in the mesh are squares whose side length is w. A spherical particle of radius r is 
dropped on the mesh. What is the probability that it passes through? What is the 
probability that it fails to pass through if it is dropped n times? (Calculations such 
as these are relevant to the theory of sieving for analyzing the size distribution of 
particulate matter.) 


5. (Buffon's Needle Problem) A needle of length L is dropped randomly on a plane 
ruled with parallel lines that are a distance D apart, where D > L. Show that 
the probability that the needle comes to rest crossing a line is 2L/(z D). Explain 
how this gives a mechanical means of estimating the value of zr. 
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6. 


10. 


11. 


12. 


13. 


14. 


15. 


A point is chosen randomly in the interior of an ellipse: 


x? y? 


a? p 


Find the marginal densities of the x and y coordinates of the point. 


=1 


. Find the joint and marginal densities corresponding to the cdf 


F(x, y) = (12e **)(1—e^9), x 0, y 20, «œ 0, B0 


. Let X and Y have the joint density 


6 
fe, y= zG XY, O<x<l, O<y<l 


a. By integrating over the appropriate regions, find (i) P(X > Y), 
(ii) P(X +Y x 1), Gii) P(X < 1). 

b. Find the marginal densities of X and Y. 

c. Find the two conditional densities. 


. Suppose that (X, Y) is uniformly distributed over the region defined by 0 < y < 


1—x? and-1 <x « 1. 
a. Find the marginal densities of X and Y. 
b. Find the two conditional densities. 


A point is uniformly distributed in a unit sphere in three dimensions. 


a. Find the marginal densities of the x, y, and z coordinates. 
b. Find the joint density of the x and y coordinates. 
c. Find the density of the xy coordinates conditional on Z = 0. 


Let U,, U2, and U3 be independent random variables uniform on [0, 1]. Find the 
probability that the roots of the quadratic Ux? + U»x + U; are real. 


Let 
f(x, y) = cQ? — ye, 0 € x « oo, —x€y«x 


a. Find c. 
b. Find the marginal densities. 
c. Find the conditional densities. 


A fair coin is thrown once; if it lands heads up, it is thrown a second time. Find 
the frequency function of the total number of heads. 


Suppose that 
f(x,y) 2 xe OF, 0<x <, 0<y<œ 


a. Find the marginal densities of X and Y. Are X and Y independent? 
b. Find the conditional densities of X and Y. 


Suppose that X and Y have the joint density function 


f(x,y) =cy 1- x? -— y? Xpy azi 


a. Find c. 
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b. Sketch the joint density. 

c. Find P(X? + Y?) < 1. 

d. Find the marginal densities of X and Y. Are X and Y independent random 
variables? 


e. Find the conditional densities. 


What is the probability density of the time between the arrival of the two packets 
of Example E in Section 3.4? 


Let (X, Y) be a random point chosen uniformly on the region R = {(x, y) : 

Ix| + lyl < 1}. 

a. Sketch R. 

b. Find the marginal densities of X and Y using your sketch. Be careful of the 
range of integration. 

c. Find the conditional density of Y given X. 


Let X and Y have the joint density function 
fe y)-kx—-y,  0xsyzxszl 


and 0 elsewhere. 


a. Sketch the region over which the density is positive and use it in determining 
limits of integration to answer the following questions. 

b. Find k. 

c. Find the marginal densities of X and Y. 

d. Find the conditional densities of Y given X and X given Y. 


Suppose that two components have independent exponentially distributed life- 
times, 7, and 75, with parameters o and £, respectively. Find (a) P(7, > 75) and 
(b) P(T, > 2D). 


If X, is uniform on [0, 1], and, conditional on X1, X», is uniform on [0, X1], find 
the joint and marginal distributions of X, and X;. 


An instrument is used to measure very small concentrations, X, of a certain 
chemical in soil samples. Suppose that the values of X in those soils in which the 
chemical is present is modeled as a random variable with density function f (x). 
The assay of a soil reports a concentration only if the chemical is first determined 
to be present. At very low concentrations, however, the chemical may fail to 
be detected even if it is present. This phenomenon is modeled by assuming that 
if the concentration is x, the chemical is detected with probability R(x). Let Y 
denote the concentration of a chemical in a soil in which it has been determined 
to be present. Show that the density function of Y is 


RFO) 
[o RGOf Q) dx 
Consider a Poisson process on the real line, and denote by N (ti, t2) the number 
of events in the interval (f, t2). If fj < ti < t», find the conditional distribution of 


N (to, t1) given that N (to, t2) = n. (Hint: Use the fact that the numbers of events 
in disjoint subsets are independent.) 
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26. 


27. 


28. 
29. 


30. 


31. 


32. 


33. 


34. 


35. 


Suppose that, conditional on N, X has a binomial distribution with N trials and 
probability p of success, and that N is a binomial random variable with m trials 
and probability r of success. Find the unconditional distribution of X. 


Let P have a uniform distribution on [0, 1], and, conditional on P — p, let X 
have a Bernoulli distribution with parameter p. Find the conditional distribution 
of P given X. 


Let X have the density function f, andlet Y — X with probability 1 and Y = —X 
with probability 1. Show that the density of Y is symmetric about zero—that is, 


fy) = fry). 


Spherical particles whose radii have the density function fr(r) are dropped on a 
mesh as in Problem 4. Find an expression for the density function of the particles 
that pass through. 


Prove that X and Y are independent if and only if fxiy(x|y) = fx(x) for all x 
and y. 


Show that C (u, v) = uv is a copula. Why is it called “the independence copula"? 


Usethe Farlie-Morgenstern copula to construct a bivariate density whose marginal 
densities are exponential. Find an expression for the joint density. 


For 0 < o < 1 and 0 < f < 1, show that C(u, v) = min(u!-*v, uv!™®) is a 
copula (the Marshall-Olkin copula). What is the joint density? 


Suppose that (X, Y) is uniform on the disk of radius 1 as in Example E of Sec- 
tion 3.3. Without doing any calculations, argue that X and Y are not independent. 


Continuing Example E of Section 3.5.2, suppose you had to guess a value of 0. 
One plausible guess would be the value of 0 that maximizes the posterior density. 
Find that value. Does the result make intuitive sense? 


Suppose that, as in Example E of Section 3.5.2, your prior opinion that the coin 
will land with heads up is represented by a uniform density on [0, 1]. You now 
spin the coin repeatedly and record the number of times, N, until a heads comes 
up. So if heads comes up on the first spin, N — 1, etc. 


a. Find the posterior density of O given N. 
b. Do this with a newly minted penny and graph the posterior density. 


This problem continues Example E of Section 3.5.2. In that example, the prior 
opinion for the value of O was represented by the uniform density. Suppose that 
the prior density had been a beta density with parameters a — b — 3, reflecting 
a stronger prior belief that the chance of a 1 was near 1. Graph this prior density. 
Following the reasoning of the example, find the posterior density, plot it, and 
compare it to the posterior density shown in the example. 


Find a newly minted penny. Place it on its edge and spin it 20 times. Following 
Example E of Section 3.5.2, calculate and graph the posterior distribution. Spin 
another 20 times, and calculate and graph the posterior based on all 40 spins. 
What happens as you increase the number of spins? 
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Let f(x) = (1 + ax)/2, for —1 < x < 1and-1xo < 1. 

a. Describe an algorithm to generate random variables from this density using 
the rejection method. 

b. Write a computer program to do so, and test it out. 


Let f(x) = 6x? (1 — x, for —1 < x < 1. 
a. Describe an algorithm to generate random variables from this density using 
the rejection method. In what proportion of the trials will the acceptance step 


be taken? 
b. Write a computer program to do so, and test it out. 


Show that the number of iterations necessary to generate a random variable using 
the rejection method is a geometric random variable, and evaluate the parameter 
of the geometric frequency function. Show that in order to keep the number of 
iterations small, M (x) should be chosen to be close to f (x). 


Show that the following method of generating discrete random variables works 
(D. R. Fredkin). Suppose, for concreteness, that X takes on values 0, 1,2, .. . with 
probabilities po, pi, p». .... Let U be a uniform random variable. If U < po, 
return X = O0. If not, replace U by U — po, and if the new U is less than pj, 
return X = 1. If not, decrement U by pı, compare U to p», etc. 


Suppose that X and Y are discrete random variables with a joint probability 
mass function pyy (x, y). Show that the following procedure generates a random 
variable X ~ pyjy (x|y). 

a. Generate X ~ px(x). 

b. Accept X with probability p(y|X). 

c. If X is accepted, terminate and return X. Otherwise go to Step a. 


Now suppose that X is uniformly distributed on the integers 1, 2,..., 100 and 
that given X = x, Y is uniform on the integers 1, 2,..., x. You observe Y = 44. 
What does this tell you about X? Simulate the distribution of X, given Y — 44, 
1000 times and make a histogram of the value obtained. How would you estimate 
E(X|Y = 44)? 


How could you extend the procedure of the previous problem in the case that X 
and Y are continuous random variables? 


a. Let T be an exponential random variable with parameter À; let W be a random 
variable independent of T, which is +1 with probability 1 each; and let X — 
WT. Show that the density of X is 





fx) = e 


which is called the double exponential density. 
b. Show that for some constant c, 
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44. 


45. 
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47. 


48. 


49. 


50. 


51. 


52. 
53. 


54. 


Use this result and that of part (a) to show how to use the rejection method to 
generate random variables from a standard normal density. 


Let U; and U, be independent and uniform on [0, 1]. Find and sketch the density 
function of S = U, + Up. 


Let N; and N, be independent random variables following Poisson distributions 
with parameters A, and A». Show that the distribution of N = N; + N is Poisson 
with parameter A; + Ao. 


For a Poisson distribution, suppose that events are independently labeled A and 
B with probabilities p4 + pg = 1. If the parameter of the Poisson distribution is 
À, show that the number of events labeled A follows a Poisson distribution with 
parameter pd. 


Let X and Y be jointly continuous random variables. Find an expression for the 
density of Z 2 X — Y. 


Let X and Y be independent standard normal random variables. Find the density 
of Z = X + Y, and show that Z is normally distributed as well. (Hint: Use the 
technique of completing the square to help in evaluating the integral.) 


Let T; and T, be independent exponentials with parameters A, and A». Find the 
density function of T; + Th. 


Find the density function of X + Y, where X and Y have a joint density as given 
in Example D in Section 3.3. 


Suppose that X and Y are independent discrete random variables and each as- 
sumes the values 0, 1, and 2 with probability i each. Find the frequency function 
of X +Y. 


Let X and Y have the joint density function f (x, y), and let Z = XY. Show that 
the density function of Z is 


oo z 1 
= Sd 
so fd) ae 


Find the density of the quotient of two independent uniform random variables. 


Consider forming a random rectangle in two ways. Let U1, U2, and U3 be inde- 
pendent random variables uniform on [0, 1]. One rectangle has sides U; and U2, 
and the other is a square with sides U3. Find the probability that the area of the 
square is greater than the area of the other rectangle. 


Let X, Y, and Z be independent N (0, o°). Let ©, ®, and R be the corresponding 
random variables that are the spherical coordinates of (X, Y, Z): 
x =rsingcosé 


y=rsing sind 
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Find the joint and marginal densities of ©, ®, and R. (Hint: dx dy dz = r? sin Q 
dr dé do.) 


A point is generated on a unit disk in the following way: The radius, R, is uniform 
on [0, 1], and the angle © is uniform on [0, 277] and is independent of R. 


a. Find the joint density of X — R cos O and Y — R sin O. 

b. Find the marginal densities of X and Y. 

c. Is the density uniform over the disk? If not, modify the method to produce a 
uniform density. 


If X and Y are independent exponential random variables, find the joint density 
of the polar coordinates R and © of the point (X, Y). Are R and © independent? 


Suppose that Y, and Y, follow a bivariate normal distribution with parameters 
Ly, = My, = 0,07, = 1,0, = 2, and p = 1/4/2. Find a linear transformation 
X| = a1Yy1 + ü12y2, X2 = a21y1 + 22 Y2 such that X1 and X2 are independent 
standard normal random variables. (Hint: See Example C of Section 3.6.2.) 





Show that if the joint distribution of X, and X; is bivariate normal, then the joint 
distribution of Y; = a; X; + b; and Y; = a5 X» + b; is bivariate normal. 


Let X, and X» be independent standard normal random variables. Show that the 
joint distribution of 


Yi = aX, + a12X2 + bı 
Y, = aya X1 + aX» b; 


is bivariate normal. 


Using the results of the previous problem, describe a method for generating pseu- 
dorandom variables that have a bivariate normal distribution from independent 
pseudorandom uniform variables. 


Let X and Y be jointly continuous random variables. Find an expression for the 
joint density of U = a + bX and V = c + dY. 


If X and Y are independent standard normal random variables, find P(X? + 
Y? « 1). 


Let X and Y be jointly continuous random variables. 


a. Develop an expression for the joint density of X + Y and X — Y. 

b. Develop an expression for the joint density of XY and Y/X. 

c. Specialize the expressions from parts (a) and (b) to the case where X and Y 
are independent. 


Find the joint density of X + Y and X/Y, where X and Y are independent 
exponential random variables with parameter A. Show that X + Y and X/Y are 
independent. 


Suppose that a system's components are connected in series and have lifetimes 
that are independent exponential random variables with parameters 4;. Show that 
the lifetime of the system is exponential with parameter )> A;. 
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Each component of the following system (Figure 3.19) has an independent ex- 
ponentially distributed lifetime with parameter A. Find the cdf and the density of 
the system's lifetime. 


FIGURE 3.19 
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74. 


A card contains n chips and has an error-correcting mechanism such that the card 
still functions if a single chip fails but does not function if two or more chips fail. 
If each chip has a lifetime that is an independent exponential with parameter A, 
find the density function of the card's lifetime. 


Suppose that a queue has n servers and that the length of time to complete a job 
is an exponential random variable. If a job is at the top of the queue and will be 
handled by the next available server, what is the distribution of the waiting time 
until service? What is the distribution of the waiting time until service of the next 
job in the queue? 


Find the density of the minimum of n independent Weibull random variables, 
each of which has the density 


f(t) = Ba Pree" > 0 


If five numbers are chosen at random in the interval [0, 1], what is the probability 
that they all lie in the middle half of the interval? 


Let X,,..., X, be independent random variables, each with the density func- 
tion f. Find an expression for the probability that the interval (—oo, Xm] 
encompasses at least 100v% of the probability mass of f. 


Let X1, X2, ..., X, be independent continuous random variables each with cu- 
mulative distribution function F. Show that the joint cdf of Xa) and Xin) is 
F(x, y) = F'(y) - IF) — Fa)", x<y 
If X,,..., X, are independent random variables, each with the density function 
f. show that the joint density of X(1),..., Xn) is 
n! f a) f Go): fn), Xp < X2 X oct KAn 
Let U,, U2, and U3 be independent uniform random variables. 


a. Find the joint density of Ua), Uo, and Ug). 

b. The locations of three gas stations are independently and randomly placed 
along a mile of highway. What is the probability that no two gas stations are 
less than i mile apart? 
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Use the differential method to find the joint density of Xg) and X(j;), wherei < j. 


Prove Theorem A of Section 3.7 by finding the cdf of X and differentiating. 
(Hint: Xa < x if and only if k or more of the X; are less than or equal to x. The 
number of X; less than or equal to x is a binomial random variable.) 


Find the density of Ug) — Uu.) if the U;, i = 1,..., n are independent uniform 
random variables. This is the density of the spacing between adjacent points 
chosen uniformly in the interval [0, 1]. 


Show that 


1 y 1 
yi dy = — =~ ~ 
J f Dco er c Ar 


If T; and T, are independent exponential random variables, find the density 
function of R = To, — Ty). 


Let U;,..., U, be independent uniform random variables, and let V be uniform 
and independent of the U;. 

a. Find P(V < Un). 

b. Find P(Ua) < V < Um). 


Do both parts of Problem 80 again, assuming that the U; and V have the density 
function f and the cdf F, with F^! uniquely defined. Hint: F(U;) has a uniform 
distribution. 
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The Expected Value of a Random Variable 


The concept of the expected value of a random variable parallels the notion of a 
weighted average. The possible values of the random variable are weighted by their 
probabilities, as specified in the following definition. 


DEFINITION 


If X is a discrete random variable with frequency function p(x), the expected 
value of X, denoted by E(X), is 


E(X) = M. xp) 


provided that $^; |x;|p(x;) < oo. If the sum diverges, the expectation is unde- 
fined. | 


E(X) is also referred to as the mean of X and is often denoted by u or ux. 
It might be helpful to think of the expected value of X as the center of mass of the 
frequency function. Imagine placing the masses p(x;) at the points x; on a beam; the 
balance point of the beam is the expected value of X. 


Roulette 

A roulette wheel has the numbers 1 through 36, as well as 0 and 00. If you bet $1 
that an odd number comes up, you win or lose $1 according to whether that event 
occurs. If X denotes your net gain, X = 1 with probability S and X — —1 with 


EXAMPLEB 


EXAMPLEC 
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probability x. The expected value of X is 


1 
19 








oe pie 
Md 3g 738 


Thus, your expected loss is about $.05. In Chapter 5, it will be shown that this coincides 
in the limit with the actual average loss per game if you play a long sequence of 
independent games. E 


Expectation of a Geometric Random Variable 
Suppose that items produced in a plant are independently defective with probability 
p. Items are inspected one by one until a defective item is found. On the average, how 
many items must be inspected? 

The number of items inspected, X, is a geometric random variable, with P(X = 
k) = q%! p, where q = 1 — p. Therefore, 


E(X) = P» kpq^ =p Y kgs 
k=l 


k=1 


We use a trick to calculate the sum. Since kq*™! = —q *, we interchange the oper- 
ations of summation and differentiation to obtain 


d q 

E(X) = p— k= p— —— 

(X) =p) 2,2 E i-a 
Pp 1 
CAR" p 


It can be shown that the interchange of differentiation and summation is justified. 
Thus, for example, if 1096 of the items are defective, an average of 10 items must be 
examined to find one that is defective, as might have been guessed. E 


Poisson Distribution 
The expected value of a Poisson random variable is 


! 
oe 


=Á 


E(X) = 





Since 57^ ,04/j!) = e^, we have E(X) = A. The parameter À of the Poisson 
distribution can thus be interpreted as the average count. E 
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EXAMPLE D St. Petersburg Paradox 

A gambler has the following strategy for playing a sequence of games: He starts off 
betting $1; if he loses, he doubles his bet; and he continues to double his bet until he 
finally wins. To analyze this scheme, suppose that the game is fair and that he wins 
or loses the amount he bets. At trial 0, he bets $1; if he loses, he bets $2 at trial 1; 
and if he has not won by the kth trial, he bets $2". When he finally wins, he will be 
$1 ahead, which can be checked by going through the scheme for the first few values 
of k. This seems like a foolproof way to win $1. What could be wrong with it? 

Let X denote the amount of money bet on the very last game (the game he wins). 
Because the probability that k losses are followed by one win is 2~¢+), 

P(X =%) : 
g ^ 2c 


and 


co 


E(X) 2 M, nP(X =n) 


Formally, E(X) is not defined. Practically, the analysis shows a flaw in this scheme, 
which is that it does not take into account the enormous amount of capital 
required. E 


The definition of expectation for a continuous random variable is a fairly obvious 
extension of the discrete case—summation is replaced by integration. 


DEFINITION 


If X is a continuous random variable with density f (x), then 
EX) = | xf (x) dx 


provided that f |x| f(x)dx < oo. If the integral diverges, the expectation is un- 
defined. L| 


Again E(X) can beregarded as the center of mass ofthe density. We next consider 
some examples. 


EXAMPLEE Gamma Density 
If X follows a gamma density with parameters o and i, 





= 9: AT a Ax 
E(X) = x"e ^ dx 
o Tœ) 
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This integral is easy to evaluate once we realize that A**! x*e ^?" / (œ+ 1) isagamma 
density and therefore integrates to 1. We thus have 


99 rE 1 
f x“e™ dx = En 
0 Ae 


from which it follows that 


ics Ae < + -= 
l(a) etl 
Finally, using the relation T (œ + 1) = oT (o), we find 
a 
E(X)= 2 


For the exponential density, a = 1, so E(X) = 1/4. This may be contrasted to 
the median of the exponential density, which was found in Section 2.2.1 to be log 2/4. 
The mean and the median can both be interpreted as "typical" values of X, but they 
measure different attributes of the probability distribution. E 


Normal Distribution 
From the definition of the expectation, we have 


1 @-u)? 
5 


E(X)= -$ dx 


1 oo 
—— xe 
osn a 
Making the change of variables z = x — u changes this equation to 

1 °° 5 2 s 25 2 
pos L ze «2e dz NE u / e /20 dz 
O^ 2x —oo oO 2x —oo 
The first integral is 0 since the contributions from z < 0 cancel those from z > 0, 
and the second integral is u because the normal density integrates to 1. Thus, 





E(X) = 


E(X) =p 


The parameter u of the normal density is the expectation, or mean value. We could 
have made the derivation much shorter by claiming that it was “obvious” that since 





the center of symmetry of the density is jz, the expectation must be u. E 
Cauchy Density 
Recall that the Cauchy density is 
fo-l(— 
x)= — j —oo«x«oo 
x \1+x? 


The density is symmetric about zero, so it would seem that E(X) = 0. However, 


f. x 
= oo 
oo l +x? 
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Therefore, the expectation does not exist. The reason that it fails to exist is, that the 
density decreases so slowly that very large values of X can occur with substantial 
probability. E 


The expected value can be interpreted as a long-run average. In Chapter 5, it will 
be shown that if E(X) exists and if X1, X2, ...is a sequence of independent random 
variables with the same distribution as X, and if S, = Y Xj, then, as n — oo, 


Sn 
— — E(X) 
n 


This statement will be made more precise in Chapter 5. For now, a simple empirical 
demonstration will be sufficient. 


Using a pseudorandom number generator, a sequence X1, X2,... of independent 
standard normal random variables was generated, as well as a sequence Y;, Yo, ... 
of independent Cauchy random variables. Figure 4.1 shows the graphs of 


1 n 1 n 
Gm) =- Y. Xi and C(n) =- XOY  n=1,2,... 
i=l i=l 


Note how Gn) appears to be tending to a limit, whereas C (n) does not. a 
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FIGURE 4.1 The average of n independent random variables as a function of n for 


(a) normal random variables and (b) Cauchy random variables. 
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We conclude this section with a simple result that is of great utility in probability 
theory. 


THEOREM A Markov’s Inequality 


If X is a random variable with P(X > 0) = 1 and for which E(X) exists, then 
IP =f) S QU’) fi. 


Proof 


We will prove this for the discrete case; the continuous case is entirely analogous. 


BOB 


= M xpQ) + Y xpo) 


x«t E 
All the terms in the sums are nonnegative because X takes on only nonnegative 
values. Thus 


E(X) = M xp(x) 


x>t 


> X MCC) " 


xt 


This result says that the probability that X is much bigger than E(X) is small. 
Suppose that in the theorem, we let £ = kE (X); then according to the result, P(X > 
kE(X)) < k-!. This holds for any nonnegative random variable, regardless of its 
probability distribution. 


Expectations of Functions of Random Variables 


We often need to find E[g(X)], where X isarandom variable and g is a fixed function. 
For example, according to the kinetic theory of gases, the magnitude of the velocity 
of a gas molecule is random and its probability density is given by 





(This is Maxwell’s distribution: the parameter o depends on the temperature of the 
gas.) From this density, we can find the average velocity, but suppose that we are 
interested in finding the average kinetic energy Y — im X?, where m is the mass of 
the molecule. The straightforward way to do this would seem to be the following: Let 
Y = g(X); find the density or frequency function of Y, say, fy; and then compute 
E(Y) from the definition. It turns out, fortunately, that the process is much simpler 
than that. 
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THEOREMA 
Suppose that Y = g(X). 
a. If X is discrete with frequency function p(x), then 


C= SS eae) 


x 


provided that X` |g(x)| p(x) < oo. 
b. If X is continuous with density function f (x), then 


E(Y) = g(x) f (x) dx 


oo 


provided that f |g(x)| f(x) dx < oo. 


Proof 


We will prove this result for the discrete case. The basic argument is the same 
for the continuous case, but making that proof rigorous requires some advanced 
theory of integration. By definition, 


E(Y) 2 3 ypv Oo) 
Let A; denote the set of x's mapped to y; by g; that is, x € A; if g(x) = y;. Then 
pro) = M pa) 


X€ Ai 


ELD es S ge SU CS) 


X€Aij 


=>) & wee) 


i X€ Ai 


- 5 M, scope) 


i x€Ai 


=> gwp) 


and 


This last step follows because the A; are disjoint and every x belongs to 
some A;. " 


It is worth pointing out explicitly that E[g(X)] Z g[E(X)]; that is, the average 
value of the function is not equal to the function of the average value. Suppose, for 
example, that X takes on values 1 and 2, each with probability 1, so E(X) = 3. Let 
Y = 1/X. Then E(Y) is clearly 1x .5 + .5 x 5— 75,butl/E(X) = 2. | 
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Let us now apply Theorem A to find the average kinetic energy of a gas molecule. 


E(Y) = J Sm? fc) dx 


0 


m s/2/7 "e 4-12 


> 


xe 70 dx 


m 3 
2 Oo 0 





To evaluate the integral, we make the change of variables u = x?/20? to reduce 
it to 








2mo? f” 32,-u d 2mo? r 5 
ue u= — 
VT Jo JT 2 


Finally, using the facts TG) = A/z and T (o + 1) = oT (o), we have 


E(Y)= imo? 


Now suppose that Y = g(X4,..., Xn), where X; have a joint distribution, and 
that we want to find E(Y). We do not have to find the density or frequency function 
of Y, which again could be a formidable task. 


THEOREM B 

Suppose that X,,..., X, are jointly distributed random variables and Y = 
g(Xi, PRIN xi) 

a. If the X; are discrete with frequency function p(x1, ..., Xn), then 


BO) I pue 


provided that Pu ET lera oc cn Se Cis on o5 dhe) < CO. 
b. If the X; are continuous with joint density function f (xi, ... , Xn), then 


BODES f [o [mte vog KoA ee odes 
provided that the integral with |g| in place of g converges. 


Proof 
The proof is similar to that of Theorem A. El 
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4.1.2 


A stick of unit length is broken randomly in two places. What is the average length 
of the middle piece? 

We interpret this question to mean that the locations of the two break points are 
independent uniform random variables U; and U2. Therefore, we need to compute 
E|U, — U;|. Theorem B tells us that we do not need to find the density function of 
|U; — U2| and that we can just integrate |u; — u2| against the joint density of U; and 
Ud, fi, u)-—1l0zu;x1l,0-zxU;x«x|I1.Thus, 


1 1 
EIU -w= f P [meds 
0 0 


1 uy 1 1 
=f n n =u) dus din + f | (u2 — uy) du» dui 
0 0 0 uy 


With some care, we find the expectation to be L, This is in accord with the intu- 
itive argument that the smaller of U; and U2 should be i on the average and the 
larger should be : on the average, which means that the average difference should 


be i. | 
We note the following immediate consequence of Theorem B. 


COROLLARY A 


If X and Y are independent random variables and g and A are fixed functions, 
then E[g(X)A(Y)] = (E[gCX)]H E[Ah(Y)]), provided that the expectations on 
the right-hand side exist. E 


In particular, if X and Y are independent, E(XY) = E(X)E(Y). The proof of 
this corollary is left to Problem 29 of the end-of-chapter problems. 


Expectations of Linear Combinations 
of Random Variables 


One of the most useful properties of the expectation is that it is a linear operation. 
Suppose that you were told that the average temperature on July 1 in a certain location 
was 70°F, and you were asked what the average temperature in degrees Celsius was. 
You can simply convert to degrees Celsius and obtain 2 x 70 — 17.7 = 21.2°C. 
The notion of the average value of a random variable, which we have defined as the 
expected value of a random variable, behaves in the same fashion. If Y = aX +b, then 
E(Y) = aE(X) +b. More generally, this property extends to linear combinations of 


random variables. 
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THEOREM A 


If Xi, ..., X, are jointly distributed random variables with expectations E(X;) 
and Y is a linear function of the X;, Y = a+ t bj X;, then 


E(Y)=a+) > bEQG) 
i=1 


Proof 


We will prove this for the continuous case. The proof in the discrete case is parallel 
and is left to Problem 24 at the end of this chapter. For notational simplicity, we 
take n = 2. From Theorem B of Section 4.1.1, we have 


E(Y) — | [^ + bx + baxo) f (x1, x2) dx; dx» 


Sef | fe X3) dx, dath f fx fe. X2) dx, dX» 


eb f [s f Gi, x2) dx; dx» 


The first double integral of the last expression is merely the integral ofthe bivariate 
density, which is equal to 1. The second double integral can be evaluated as 


follows: 
[ fase. x2) dx, dx» = Jo |/ f Ga. x2) d| dxi 


= n p (x1) dx, 
= E(X1) 
A similar evaluation for the third double integral brings us to 
E(Y) =a +b, E(X1) + bo E(X2) 


This proves the theorem once we check that the expectation is well defined, or 
that 


J [^ + Dix, + boxo| f (x1, x2) dx; dx» < oo 
This can be verified using the inequality 
|a + byxi + boxj| < la| + bil lai] + [b2]|x2] 


and the assumption that the E (X;) exist. H 


Theorem A is extremely useful. We will illustrate its utility with several examples. 
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EXAMPLE A Suppose that we wish to find the expectation of a binomial random variable, Y. From 
the binomial frequency function, 


E(Y) = «ko! 1 — py" 
(Y) >, ( p J0- p) 
It is not immediately obvious how to evaluate this sum. We can, however, represent 
Y as the sum of Bernoulli random variables, X;, which equal 1 or 0 depending on 
whether there is success or failure on the ith trial, 


yas 
i=l 


Because E(X;) = 0x (1 — p)+ 1 x p = p, it follows immediately that E (Y) = np. 

An application of the binomial distribution and its expectation occurs in “shotgun 
sequencing” in genomics, a method of trying to figure out the sequence of letters that 
make up a long segment of DNA. It is technically too difficult to sequence the entire 
segment at once if it is very long. The basic idea of shotgun sequencing is to chop 
the DNA randomly into many small fragments, sequence each fragment, and then 
somehow assemble the fragments into one long “contig.” The hope is that if there are 
many fragments, their overlaps can be used to assemble the contig. 

Suppose, then, that the length of the DNA sequence is G and that there are N 
fragments each of length L. G might be at least 100,000 and L about 500. Assume that 
the left end of each fragment is equally likely to be at positions 1,2, ..., G — L+ 1. 
What is the probability that a particular location x € (L, L + 1,..., G} is covered 
by at least one fragment? How many fragments are expected to cover a particular 
location? (The positions (1, 2, ..., L — 1} are not included in this discussion because 
the boundary effect makes them a little different; for example, the only fragment 
that covers position 1 has its left end at position 1.) To answer these questions, first 
consider a single fragment. The chance that it covers x equals the chance that its 
left end is in one of the L locations (x — L + 1l, x — L,...,x}, and because the 
location of the left end is uniform, this probability is 

L L 
“GLF G 
where the approximation holds because G >> L. Thus, the distribution of W, the 
number of fragments that cover a particular location, is binomial with parameters N 
and p. 
From the binomial probability formula, the chance of coverage is 


P 


LNY 
Pw>0=1-PW=0=1- (1-5) 


Since N is large and p is small, the distribution of W is nearly Poisson with parameter 
à = Np = NL/G. From the Poisson probability formula, P(W = 0) ~ e-""/6, 
so the probability that a particular location is covered is approximately 1 — e^^"/6, 
Observe that N L is the total length of all the fragments; the ratio N L/G is called the 
coverage. Calculations of this kind are thus useful in deciding how many fragments 
to use. If the coverage is 8, for example, the chance that a site is covered is .9997. 
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Overlap of fragments is important when trying to assemble them. Since W isa 
binomial random variable, the expected number of fragments that cover a given site 
is Np = NL/G, precisely the coverage. 

We can also now answer this closely related question: How many sites do we 
expect to be entirely missed? We will calculate this using indicator random variables: 
let /, equal 1 if site x is missed and 0 elsewhere. Then 


E()21x PU, =1)+0x PU, =0) =e "/6, 


The number of sites that are not covered is 


Va h 


x=1 


and from the linearity of expectation 


G 
EV) =) El) ~ Ge We. 
x=L 
The length of the human genome is approximately G = 3 x 10°, so with eight times 
coverage, we would expect about a million sites to be missed. E 


Coupon Collection 

Suppose that you collect coupons, that there are n distinct types of coupons, and that 
on each trial you are equally likely to get a coupon of any of the types. How many trials 
would you expect to go through until you had a complete set of coupons? (This might 
be a model for collecting baseball cards or for certain grocery store promotions.) 

The solution of this problem is greatly simplified by representing the number of 
trials as a sum. Let X; be the number of trials up to and including the trial on which 
the first coupon is collected: X; = 1. Let X be the number of trials from that point up 
to and including the trial on which the next coupon different from the first is obtained; 
let X5 be the number of trials from that point up to and including the trial on which 
the third distinct coupon is collected; and so on, up to X,. Then the total number of 
trials, X, is the sum of the X;, i= 1, 2, ..., n. 

We now find the distribution of X,. At this point, r — 1 of n coupons have been 
collected, so on each trial the probability of success is (n — r + 1)/n. Therefore, X, 
is a geometric random variable, with E(X,) — n/(n — r 4- 1). (See Example B of 
Section 4.1.) Thus, 


n 


E(X) 2 3, E(X) 


r=1 





For example, if there are 10 types of coupons, the expected number of trials necessary 
to obtain at least one of each kind is 29.3. 
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Finally, we note the following famous approximation: 


n 


1 
X - = logn 4 y + 8, 
7 


ral 


where log is the natural log or log, (unless otherwise specified, log means natural 
log throughout this text), y is Euler's constant, y = .57..., and &, approaches zero 
as n approaches infinity. Using this approximation for n — 10, we find that the 
approximate expected number of trials is 28.8. Generally, we see that the expected 
number of trials grows at the rate n logn, or slightly faster than n. E 


Group Testing 

Suppose that a large number, n, of blood samples are to be screened for a relatively 
rare disease. If each sample is assayed individually, n tests will be required. On the 
other hand, if each sample is divided in half and one of the halves is put into a pool 
with all the other halves, the pooled lot can be tested. Then, provided that the test 
method is sensitive enough, if this test is negative, no further assays are necessary 
and only one test has to be performed. If the test on the pooled blood is positive, each 
reserved half-sample can be tested individually. In this case, a total of n + 1 tests 
will be required. It is therefore plausible, assuming that the disease is rare, that some 
savings can be achieved through this pooling procedure. 

To analyze this more quantitatively, let us first generalize the scheme and suppose 
that the n samples are first grouped into m groups of k samples each, or n = mk. 
Each group is then tested; if a group tests positively, each individual in the group is 
tested. If X; is the number of tests run on the ith group, the total number of tests run 
is N = 77 , X;, and the expected total number of tests is 


m 


E(N) = >) EQ) 


i-l 


Let us find E(X;). If the probability of a negative on any individual sample is p, then 
the X; take on the value 1 with probability p^ or the value k + 1 with probability 
1 — p*. Thus, 


E(X) = p' + (k+ D — p 
=k+1—kp* 


We now have 
k I k 
E(N) = m(k + 1) — mkp* =n biep 


Recalling that n tests are necessary with no pooling, we see that the factor (1 + 1/k — 
p*) is the average number of samples used in group testing as a proportion of n. 
Figure 4.2 shows this proportion as a function of k for p = .99. From the figure, we 
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Proportion 








FIGURE 4.2 The proportion of n in the average number of samples tested using 
group testing as a function of k. 


see that for group testing with a group size of about 10, only 2096 of the number of 
tests used with the straightforward method are needed on the average. a 


Counting Word Occurrences in DNA Sequences 
Here we consider another example from genomics, and one that again illustrates 
the power of using indicator random variables. In searching for patterns in DNA 
sequences, there might be reason to expect that a “word” such as TATA would occur 
more frequently than expected in a random sequence. Or suppose we want to identify 
regions of a DNA sequence in which the occurrence of the word is unusually large. 
To quantify these ideas, we need to specify the meaning of random. In this example, 
we will take it to mean that the sequence is randomly composed of the letters A,C,G, 
and T in the sense that the letters at sites are independent and, at every site, each letter 
has probability ie 

We also need to be careful to specify how we count. Consider the following 
sequence 


ACTATATAGATATA 


We will count overlaps, so in the preceding sequence, TATA occurs three times. Now 
suppose that the sequence is of length N and that the word is of length q. Let I, 
be an indicator random variable taking on the value 1 if the word begins at position 
n and 0 otherwise: P(J, = 1) = (' from the assumption of independence and 
E(1,) = P(1, = 1). Now the total number of times the word occurs is 
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4.2 


and 


N-q+l 


1 q 
E(W) = Y EQ) 2 (N - a) (1) 


n=1 


Note that the 7,, are not independent—for example, in the case of the word TATA, 
if 7, = 1, then h = 0. Thus W is not a binomial random variable. But despite the 
lack of independence, we can find E(W) by expressing W as a linear combination of 
indicator variables. | 


Investment Portfolios 

An investor plans to apportion an amount of capital, Cp, between two investments, 
placing a fraction m, 0. € a < 1, in one investment and a fraction 1 — z in the 
other for a fixed period of time. Denoting the returns (final value divided by initial 
value) on the investments by R; and R3, her capital at the end of the period will be 
C; = zx CoR, (1 — x )Co R2. Her return will then be 


— Ci 
=a 
—zR T-—z)R 


R 


Suppose that the returns are unknown, as would be the case if they were stocks, for 
example, and that they are hence modeled as random variables, with expected values 
E(R)) and E(R»). Then her expected return is 


E(R) = 1 E(Ri) + (1 — 1) E(R2) 


How should she choose zx? A simple solution would apparently be to choose z = 1 
if E(R,) > E(R2) and x = 0 otherwise. But there is more to the story as we will see 
Jater. E 


Variance and Standard Deviation 


The expected value of a random variable is its average value and can be viewed as 
an indication of the central value of the density or frequency function. The expected 
value is therefore sometimes referred to as a location parameter. The median of 
a distribution is also a location parameter, one that does not necessarily equal the 
mean. This section introduces another parameter, the standard deviation of a random 
variable, which is an indication of how dispersed the probability distribution is about 
its center, of how spread out on the average are the values of the random variable about 
its expectation. We first define the variance of a random variable and then define the 
standard deviation in terms of the variance. 
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DEFINITION 
If X is a random variable with expected value E (X), the variance of X is 
Var(X) = E([X — ECOF] 


provided that the expectation exists. The standard deviation of X is the square 
root of the variance. a 


If X is a discrete random variable with frequency function p(x) and expected 
value u = E(X), then according to the definition and Theorem A of Section 4.1.1, 


Var(X) 2 3 7 (x; — uy pi) 


whereas if X is a continuous random variable with density function f(x) and 
E(X)=u 


Var(X) = J * Oe — uw)? Fe) dx 


The variance is often denoted by o? and the standard deviation by o. From 
the preceding definition, the variance of X is the average value of the squared 
deviation of X from its mean. If X has units of meters, for example, the vari- 
ance has units of meters squared, and the standard deviation has units of meters. 
Although we are often interested ultimately in the standard deviation rather than 
the variance, it is usually easier to find the variance first and then take the square 
root. 

The variance of a random variable changes in a simple way under linear trans- 
formations. 


THEOREM A 
If Var(X) exists and Y = a+ bX, then Var(Y) = D? Var(X). 


Proof 
Since E(Y) =a + bE(X), 
E[(Y — E(Y))’] = Ella +bX — a — bEQOY] 
= E(P'[x — EQOY) 
= DEX — EQOY) 
= b’Var(X) a 


This result seems reasonable once you realize that the addition of a constant does 
not affect the variance, since the variance is a measure of the spread around a center 
and the center has merely been shifted. 
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The standard deviation transforms in a natural way: oy = |b|ox. Thus, if the units 
of measurement are changed from meters to centimeters, for example, the standard 
deviation is simply multiplied by 100. 


Bernoulli Distribution 

If X has a Bernoulli distribution—that is, X takes on values 0 and 1 with probability 
1 — p and p, respectively—then we have seen (Example A of Section 4.1.2) that 
E(X) — p. By the definition of variance, 


Var(X) = (0- p? x 1- p) c (1— p? x p 
=p—p+p-2p+p 
= p(l- p) 


Note that the expression p(1 — p) is a quadratic with a maximum at p — 1. If p 
is 0 or 1, the variance is 0, which makes sense since the probability distribution is 
concentrated at a single point and the random variable is not variable at all. The 
distribution is most dispersed when p — L. E 


Normal Distribution 
We have seen that E(X) = u. Then 


1 s — 
Var(X) = E[(X — p) = —— J (x — pyre ME dx 
OX 271 Joo 





Making the change of variables z = (x — 4)/o changes the right-hand side to 


o? [E : _2p E 
-= ze ''dz 
A27. Joo 
Finally, making the change of variables u = z?/2 reduces the integral to a gamma 


function, and we find that Var(X) — o?. | 


The following theorem gives an alternative way of calculating the variance. 


THEOREM B 


The variance of X, if it exists, may also be calculated as follows: 


NAO) = 102) = VBC ONE 


EXAMPLEC 
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Proof 
Denote E(X) by p. 
Var(X) = E[(X — uy? 
= E(X? -2uX +p’) 


By the linearity of the expectation, this becomes 
Var(X) = E(X’) - 2uE(X) + u? 
= E(X’) - 2,9 + p’ 
= E(X*)— 


as was to be shown. L| 


According to Theorem B, the variance of X can be found in two steps: First find 
E(X), and then find E(X?). 


Uniform Distribution 
Let us apply Theorem B to find the variance of a random variable that is uniform on 
[0, 1]. We know that E(X) — n next we need to find E(X?): 


2 2 I 
E(X ) => a. dx = — 
We thus have 0 3 


It was stated earlier that the variance or standard deviation of a random variable 
gives an indication as to how spread out its possible values are. A famous inequality, 
Chebyshev's inequality, lends a quantitative aspect to this indication. 


THEOREM C Chebyshev’s Inequality 


Let X be a random variable with mean u and variance o?. Then, for any t > 0, 
92 
et) 8 


Proof 


Let Y = (X— u)’. Then E(Y) = o?, and the result follows by applying Markov's 
inequality to Y. a 


134 Chapter 4 


EXAMPLED 


Expected Values 


Theorem C says that if o? is very small, there is a high probability that X will 
not deviate much from jz. For another interpretation, we can set t = ko so that the 
inequality becomes 


1 
POE — u| Z ko) < y3 


For example, the probability that X is more than 40 away from m is less than or 

equal to x. These results hold for any random variable with any distribution provided 

the variance exists. In particular cases, the bounds are often much narrower. For 

example, if X is normally distributed, we find from tables of the normal distribution 

that P(|X — u| > 20) = .05 (compared to i obtained from Chebyshev's inequality). 
Chebyshev's inequality has the following consequence. 


COROLLARY A 
If Var(X) = 0, then P(X = u) = 1. 


Proof 


We will give a proof by contradiction. Suppose that P(X = u) < 1. Then, for 
some £ > 0, P(|X — u| > £) > 0. However, by Chebyshev’s inequality, for any 
e> 0, 


P(|X —p| =e) 20 [| 


Investment Portfolios 

We continue Example E in Section 4.1.2. Suppose that one of the two investments 
is risky and the other is risk free. The first might be a stock and the other an insured 
savings account. The stock has a return R;, which is modeled as a random variable with 
expectation 44, = 0.10 and standard deviation o; = 0.075. The standard deviation is 
a measure of risk—a large standard deviation means that the returns fluctuate a lot 
so that the investor might be lucky and get a large return, but might also be unlucky 
and lose a lot. Suppose that the savings account has a certain return Ry = 0.03. The 
expected value of this return is u2 = 0.03 and its standard deviation is 0—it is risk 
free. If the investor places a fraction zr; in the stock and a fraction m = 1 — 7; in the 
savings account, her return is 


R =m R Tt(-s)R 
and her expected return is 
E(R) = mip + Ud — m1) po 


Since 4, > u2, her expected return is maximized by zr, = 1, putting all her money 
in the stock. However, this point of view is too narrow; it does not take into account 
the risk of the stock. By Theorem A 


Var(R) = 720; 


4.2.1 
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and the standard deviation of the return is og = x101. The larger xı, the larger 
the expected return, but also the larger the risk. In choosing 71, the investor has to 
balance the risk she is willing to take against the expected gain; the desired balance 
will be different for different investors. If she is risk averse, she will choose a small 
value of 74, being leery of volatile investments. By tracing out the expected return 
and the standard deviation as functions of x4, she can strike a balance with which she 
is comfortable. E 


A Model for Measurement Error 


Values of physical constants are not precisely known but must be determined by 
experimental procedures. Such seemingly simple operations as weighting an object, 
determining a voltage, or measuring an interval of time are actually quite complicated 
when all the details and possible sources of error are taken into account. The National 
Institute of Standards and Technology (NIST) in the United States and similar agen- 
cies in other countries are charged with developing and maintaining measurement 
standards. Such agencies employ probabilists and statisticians as well as physical 
scientists in this endeavor. 

A distinction is usually made between random and systematic measurement 
errors. A sequence of repeated independent measurements made with no deliberate 
change in the apparatus or experimental procedure may not yield identical values, 
and the uncontrollable fluctuations are often modeled as random. At the same time, 
there may be errors that have the same effect on every measurement; equipment may 
be slightly out of calibration, for example, or there may be errors associated with the 
theory underlying the method of measurement. If the true value of the quantity being 
measured is denoted by xo, the measurement, X, is modeled as 


X — xoc +e 


where f is the constant, or systematic, error and e is the random component of the 
error; £ is a random variable with E(&) = 0 and Var(e) = o?. We then have 


E(X) — xo + B 
and 
Var(X) = o? 


P is often called the bias of the measurement procedure. The two factors affecting the 
size of the error are the bias and the size of the variance, o2. A perfect measurement 
would have £ = 0 and o? = 0. 


Measurement of the Gravity Constant 

This and the next example are taken from an interesting and readable paper by Youden 
(1972), astatistician at NIST. Measurement of the acceleration due to gravity at Ottawa 
was done 32 times with each of two different methods (Preston- Thomas et al. 1960). 
The results are displayed as histograms in Figure 4.3. There is clearly some systematic 
difference between the two methods as well as some variation within each method. It 
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FIGURE 4.3 Histograms of two sets of measurements of the acceleration due to 
gravity. 


appears that the two biases are unequal. The results from Rule 1 are more scattered 
than those of Rule 2, and their standard deviation is larger. [| 


An overall measure of the size of the measurement error that is often used is the 
mean squared error, which is defined as 


MSE = E[(X — xo)?] 


The mean squared error, which is the expected squared deviation of X from xo, can 
be decomposed into contributions from the bias and the variance. 


THEOREM A 
MSE = f? + 0°. 
Proof 


From Theorem B of Section 4.2, 
E[(X — xo^] = Var(X — xo) + [E(X — xo) 
= Var(X) + f? 
= o? aL p B" 
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Measurements are often reported in the form 102 + 1.6, for example. Although it 
is not always clear what precisely is meant by such notation, 102 is the experimentally 
determined value and 1.6 is some measure of the error. It is often claimed or hoped that 
P is negligible relative to ø, and in that case 1.6 represents ø or some multiple of c. 
In the graphical presentation of experimentally obtained data, error bars, usually of 
width o or some multiple of c, are placed around measured values. In some cases, 
efforts are made to bound the magnitude of £, and the bound is incorporated into the 
error bars in some fashion. 


Measurement of the Speed of Light 

Figure 4.4, taken from McNish (1962) and discussed by Youden (1972), shows 24 
independent determinations of the speed of light, c, with error bars. The right col- 
umn of the figure contains codes for the experimental methods used to obtain the 
measurements; for example, G denotes a method called the geodimeter method. 
The range of values for c is about 3.5 km/sec, and many of the errors are less than 
.5 km/sec. Examination of the figure makes it clear that the error bars are too small and 
that the spread of values cannot be accounted for by different experimental techniques 
alone—the geodimeter method produced both the smallest and the next to largest value 
for c. Youden remarks, "Surely the evidence suggests that individual investigators are 
unable to set realistic limits of error to their reported values.” He goes on to suggest 
that the differences are largely a result of calibration errors for equipment. E 
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FIGURE 4.4 A plot of 24 independent determinations of the speed of light with the 
reported error bars. The investigator or country is listed in the left column, and the 
experimental method is coded in the right column. 
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Covariance and Correlation 


The variance of a random variable is a measure of its variability, and the covariance 
of two random variables is a measure of their joint variability, or their degree of asso- 
ciation. After defining covariance, we will develop some of its properties and discuss 
a measure of association called correlation, which is defined in terms of covariance. 
You may find this material somewhat formal and abstract at first, but as you use them, 
covariance, correlation, and their properties will begin to seem natural and familiar. 


DEFINITION 


If X and Y are jointly distributed random variables with expectations ux and uy, 
respectively, the covariance of X and Y is 


Cov(X, Y) = EX- ux) — py) 


provided that the expectation exists. [| 


The covariance is the average value of the product of the deviation of X from 
its mean and the deviation of Y from its mean. If the random variables are positively 
associated—that is, when X is larger than its mean, Y tends to be larger than its mean 
as well—the covariance will be positive. If the association is negative—that is, when 
X is larger than its mean, Y tends to be smaller than its mean—the covariance is 
negative. These statements will be expanded in the discussion of correlation. 

By expanding the product and using the linearity of the expectation, we obtain 
an alternative expression for the covariance, paralleling Theorem B of Section 4.2: 


Cov(X, Y) = E(XY — Xuy — Yux + Ax) 
= E(XY) — E(X)uy — E(Y) x + uxMv 
— E(XY) — E(X)E(Y) 
In particular, if X and Y are independent, then E(XY) = E(X) E(Y) andCov(X, Y) = 


0 (but the converse is not true). See Problems 59 and 60 at the end of this chapter for 
examples. 


Let us return to the bivariate uniform distributions of Example C in Section 3.3. First, 
note that since the marginal distributions are uniform, E(X) — E(Y)— 1. For the 
case œ = — 1, the joint density of X and Y is f(x, y) 2 Qx + 2y —4xy),0<x <1, 
0syxl 


ga») = | [sto y)dxdy 


1 pl 
x | xy(2x + 2y — Axy) dx dy 
0 Jo 


2 
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Thus, 





eva, 1) = $- (3) (3) = -3 


The covariance is negative, indicating a negative relationship between X and Y. In 


fact, from Figure 3.5, we see that if X is less than its mean, 1 then Y tends to be 
larger than its mean, and vice versa. A similar analysis shows that when a = 1, 


Cov(X, Y) = x. | 


We will now develop an expression for the covariance of linear combinations of 
random variables, proceeding in a number of small steps. First, since E(a + X) = 
a+ E(X), 

Cov(a + X, Y) = El[a + X — E(a + X)][¥ — E(Y)]}} 
= E([X — E(X)]I[Y — EQ)]) 
— Cov(X, Y) 


Next, since E (aX) = aE(X), 


Cov(aX, bY) = E([aX — aE(X)][bY — bE(Y)]) 
= E{ab[X — E(X)][Y — E(Y)]) 
= abE{[X — E(X)][Y — E(Y)]) 
— ab Cov(X, Y) 


Next, we consider Cov(X, Y + Z): 


Cov(X, Y + Z) = E(X — E(X)]{LY — E(QY)] + [Z — E(Z)]p) 
= E{[X — E(X)]I[Y — E(Y)] + [X — ECXIZ — E(Z)]} 
= E([X — E(X)][Y — EQ)U 
+ E{[X — E(X)][Z — E(Z)]} 
= Cov(X, Y) + Cov(X, Z) 


We can now put these results together to find Cov(aW + bX, cY + dZ): 


Cov(aW + bX, cY +dZ) = Cov(aW + bX, cY) + Cov(aW + bX, dZ) 
= Cov(aW, cY) + Cov(bX, cY) + Cov(aW, dZ) 
+ Cov(bX, dZ) 
= ac Cov(W, Y) + bc Cov(X, Y) + ad Cov(W, Z) 
+ bd Cov(X, Z) 


In general, the same kind of argument gives the following important bilinear 
property of covariance. 
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EXAMPLEC 


THEOREMA 
Suppose that U =a + $7; , b;X; and V = c + $ 77 , d;Y;. Then 


Cov, V) =X V. bidjCov(X;, Y;) " 


i=l) ja 


This theorem has many applications. In particular, since Var(X) = Cov(X, X), 
Var(X + Y) = Cov(X + Y, X + Y) 
= Var(X) + Var(Y) + 2Cov(X, Y) 


More generally, we have the following result for the variance of a linear combination 
of random variables. 


COROLLARY A 
Var(a TX b; Xj) = W Da bijb;Cov(X;, Xj). |_| 


If the X; are independent, then Cov(X;, X;) = 0 fori  j, and we have another 
corollary. 


GO R@ EIEN Raa 
Var(>~;_, Xi) = $5.4 Var(X;), if the X; are independent. o 


Corollary B is very useful. Note that E(S> X;) = X E(X;) whether or not the 
X; are independent, but it is generally not the case that Var(» 7. X;) = 5 Var(X;). 


Finding the variance of a binomial random variable from the definition of variance and 
the frequency function of the binomial distribution is not easy (try it). But expressing 
a binomial random variable as a sum of independent Bernoulli random variables 
makes the computation of the variance trivial. Specifically, if Y is a binomial random 
variable, it can be expressed as Y = X;+ X5 +- -- - X,, where the X; are independent 


Bernoulli random variables with P(X; — 1) — p. We saw earlier (Example A in 
Section 4.2) that Var(X;) — p(1 — p), from which it follows from Corollary B that 
Var(Y) = np(1— p). E 


Random Walk 
A drunken walker starts out at a point xo on the real line. He takes a step on length X;, 
which is a random variable with expected value jz and variance o, and his position 
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at that time is S(1) = xo + X,. He then takes another step of length X5, which is 
independent of X, with the same mean and standard deviation. His position after n 
such steps is S(n) = xo + >>, X;. Then 


E(S(n)) =x + E (xx) = xo tn 


i-l 


Var(S(n)) = Var (Y x) — no? 


i=l 
He thus expects to be at the position xo + nu with an uncertainty as measured by the 
standard deviation of yno. Note that if u > 0, for example, for large values of n 
he will be to the right of the point xo with very high probability (using Chebyshev’s 
inequality). 

Random walks have found applications in many areas of science. Brownian mo- 
tion is a continuous time version of a random walk with the steps being normally 
distributed random variables. The name derives from observations of the biologist 
Robert Brown in 1827 of the apparently spontaneous motion of pollen grains sus- 
pended in water. This was later explained by Einstein to be due to the collisions of 
the grains with randomly moving water molecules. 

The theory of Brownian motion was developed by Louis Bachelier in 1900 in his 
PhD thesis “The theory of speculation,” which related random walks to the evolution 
of stock prices. If the value of a stock evolves through time as a random walk, its 
short-term behavior is unpredictable. The efficient market hypothesis states that stock 
prices already reflect all known information so that the future price is random and 
unknowable. The solid line in Figure 4.5 shows the value of the S&P 500 during 


1300 + 
1200 - 


1100+ 


Value 


1000 + 


900 + 








800 F, 1 1 1 f 1 
0 50 100 150 200 250 
Time 





FIGURE 4.5 The solid line is the value of the S&P 500 during 2003. The dashed 
lines are simulations of random walks. 
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2003. The average of the increments (steps) was 0.81 and the standard deviation was 
9.82. The dashed lines are simulations of random walks with the same intial value 
and increments that were normally distributed random variables with u = 0.81 and 
o — 9.82. Notice the long stretches of upturns and downturns that occurred in the 
random walks as the markets reacted in ways that would have been explained ex post 
facto by analysts. See Malkiel (2004) for a popular exposition of the implications of 
random walk theory for stock market investors. E 


The correlation coefficient is defined in terms of the covariance. 


DEFINITION 


If X and Y are jointly distributed random variables and the variances and covari- 
ances of both X and Y exist and the variances are nonzero, then the correlation 
of X and Y, denoted by p, is 

Cov(X, Y 


A/ Var( X) Var(Y ) 


Note that because of the way the ratio is formed, the correlation is a dimension- 
less quantity (it has no units, such as inches, since the units in the numerator and 
denominator cancel). From the properties of the variance and covariance that we have 
established, it follows easily that if X and Y are both subjected to linear transforma- 
tions (such as changing their units from inches to meters), the correlation coefficient 
does not change. Since it does not depend on the units of measurement, p is in many 
cases a more useful measure of association than is the covariance. 


Let us return to the bivariate uniform distribution of Example A. Because X and Y 
are marginally uniform, Var(X) — Var(Y) — > In the one case (a = — 1), we found 


Cov(X, Y) = —i.so 
p= -i x 12 = -i 
1 


In the other case (œ = 1), the covariance was xc, 


so the correlation is 1. E 


The following notation and relationship are often useful. The standard deviations 
of X and Y are denoted by oy and oy and their covariance by oyy. We thus have 


Oxy 





OxOy 
and 
Oxy = pOxOy 


The following theorem states some further properties of p. 


EXAMPLEE 
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THEOREM B 


—] < p < 1. Furthermore, p = +1 if and only if P(Y = a +bX) = 1 for some 
constants a and b. 


Proof 
Since the variance of a random variable is nonnegative, 


X Y 
osva(Ž+Ž) 
oO 


X Oy 


X Y XM 
= Var (=) + Var (=) + 2Cov (=. =| 
Ox Oy Ox Oy 


Var(X) | Var(Y) P 2Cov(X, Y) 
a Ge 02 OxOy 


= 2(1- p) 


From this, we see that p > —1. Similarly, 





X Y 
os var(= -L)-21-0» 


Ox Oy 


implies that p < 1. Suppose that p = 1. Then 


which by Corollary A of Section 4.2 implies that 


AX Y 
p(Žž-Ž=0)=1 
Ox Oy 


for some constant, c. This is equivalent to P(Y = a + bX) = 1 for some a and 
b. A similar argument holds for o = —1. L| 


Investment Portfolio 

We are now in a position to further develop the investment theory discussed in Sec- 
tion 4.1.2, Example E, and Section 4.2, Example D. Please review those examples be- 
fore continuing. We first consider the simple example of two securities, assuming that 
they have the same expected returns yı = y2 = u and their returns are uncorrelated: 
oij = Cov(R;, R;) = 0. For a portfolio (zr, 1 — zr), the expected return is 


E(R()-—mxu-c-(üuü-m)-u 


so that when considering expected return only, the choice of z makes no difference. 
However, taking risk into account, 


Var(R(x)) = 220? + (1 — 1Yo2. 
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Minimizing this with respect to z gives the optimal portfolio 


2 
95 


Top = 5, 3 

P o? + oF 
For example, if the investments are equally risky, o; = o» = o, then x = 1/2, so the 
best strategy is to split her total investment equally between the two securities. If she 
does so, the variance of her return is, by Theorem A, 


whereas if she put all her money in one security, the variance of her return would 
be o°. The expected return is the same in both cases. This is a particularly simple 
example of the value of diversification of investments. 

Suppose now that the two securities do not have the same expected returns, 
Hı < Ha. Let the standard deviations of the returns be o, and o»; usually less risky 
investments have lower expected returns, o; < o5. Furthermore, the two returns may 
be correlated: Cov(R,, R2) = po10». Corresponding to the portfolio (7r, 1 — 2), we 
have expected return 


E(R(x)) = nu + (1 — 7)u» 
and the variance of the return is 
Var(R(x)) = mo? + 2x (1 — 1)p010; + (1 — 1)?02 


Comparing this to the result when the returns were independent, we see the risk is 
lower when the returns are independent than when they are positively correlated. It 
would thus be better to invest in two unrelated or weakly related market sectors than 
to make two investments in the same sector. In deciding the choice of the portfolio 
vector, the investor can study how the risk (the standard deviation of R(z)) changes 
as the expected return increases, and balance expected return versus risk. 

In actual investment decisions, many more than two possible investments are 
involved, but the basic idea remains the same. Suppose there are n possible invest- 
ments. Let the portfolio weights be denoted by the vector 7 = (71, 77:5, ..., An). Let 
E(R;) = pi, Cov(R;, Rj) = oij (so, in particular, Var(R;) is denoted by o;;), then 


E(RG)) = mui 


and 
n n 


Var(R(z)) = se 5 7'7tjOjj. 


i=1 j=l 


The investment decision, the choice of the portfolio vector zr, is often couched as 
that of maximizing expected return subject to the risk being less than some value the 
individual investor is willing to tolerate. Some investors are more risk averse than 
others, so the portfolio vectors will differ from investor to investor. Equivalently, the 
decision may be phrased as that of finding the portfolio vector with the minimum risk 
subject to a desired return; there may well be many portfolio choices that give the 
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FIGURE 4.6 The benefit of diversification. The monthly average return from January 
1992 to June 1994 of 13 stock markets, plotted against their standard deviations. The 
performance of the Standard and Poor's 500 index of U.S. stocks is plotted for 
comparison. 


same expected return, and the wise investor would choose the one among them that 
had the lowest risk. 

As a general rule, risk is reduced by diversification and can be decreased with 
only a small sacrifice of returns. Figure 4.6 from Bernstein (1996, p. 254) illustrates 
this point empirically. The point labeled “Index” shows the monthly average versus 
standard deviation for an investment that was equally weighted across all the markets. 
A reasonably high return with relatively little risk would thus have been obtained by 
spreading investments equally over the 13 stock markets. In fact, the risk is less than 
that of any of the individual markets. Note that these emerging markets were riskier 
than the U.S. market, but that they were more profitable. E 


Bivariate Normal Distribution 

We will show that the covariance of X and Y when they follow a bivariate normal 
distribution is poxoy, which means that p is the correlation coefficient. The covari- 
ance is 


Cov(X, Y) = i, / Cocinas eod 


Making the changes of variables u = (x — ux)/ox and v = (y — uy)/oy changes 
the right-hand side to 


(u? + v? — 2puv) | dudv 


OxOy IL. | 1 
— uv ex SS 
mdl-F dadas V eA 
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As in Example F in Section 3.3, we use the technique of completing the square to 
rewrite this expression as 


NUM Du v exp(—v^/2) (^ u ex l-a- | du) dv 
274/1 = Ja P ET P 2(1 — p?) ^ 


The inner integral is the mean of an N[pv, (1 — o?)] random variable, lacking only 
the normalizing constant [27 (1 — p7)]~'/?, and we thus have 


oo 
poxoy 2 =y 
m ee ve 

X27T  J-o 


as was to be shown. L| 


Cov(X, Y) — Pdy= poxoy 


The correlation coefficient measures the strength of the linear relationship 
between X and Y (compare with Figure 3.9). Correlation also affects the appearance 
of a scatterplot, which is constructed by generating n independent pairs (X;, Y;), 
where i = 1, ..., n, and plotting the points. Figure 4.7 shows scatterplots of 100 
pairs of pseudorandom bivariate normal random variables for various values of p. 
Note that the clouds of points are roughly elliptical in shape. 





























FIGURE 4.7 Scatterplots of 100 independent pairs of bivariate normal random 
variables, (a) o = 0, (b) p = .3, (c) p= .6, (d) p = .9. 
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Conditional Expectation and Prediction 


Definitions and Examples 


In Section 3.5, conditional frequency functions and density functions were defined. 
We noted that these had the properties of ordinary frequency and density functions. In 
particular, associated with a conditional distribution is a conditional mean. Suppose 
that Y and X are discrete random variables and that the conditional frequency function 
of Y given x is py|x (y|x). The conditional expectation of Y given X = x is 


E(Y|X 2 x) => ypyx(ylx) 
» 
For the continuous case, we have 


E(Y|X — x) = [stisotody 


More generally, the conditional expectation of a function A (Y) is 


E[n(Y)|X =x] = P OE Coe. 


in the continuous case. A similar equation holds in the discrete case. 


Consider a Poisson process on [0, 1] with mean 4, and let N be the number of points 
in [0, 1]. For p < 1, let X be the number of points in [0, p]. Find the conditional 
distribution and conditional mean of X given N — n. 

We first find the joint distribution: P(X — x, N — n), which is the probability 
of x events in [0, p] and n — x events in [p, 1]. From the assumption of a Poisson 
process, the counts in the two intervals are independent Poisson random variables 
with parameters pà and (1 — p)A, so 


(PA EP [(1 — pfe 0-0» 


x! (n — x)! 





Pxu(x, n) — 


The marginal distribution of N is Poisson, so the conditional frequency function of 
X is, after some algebra, 


pxuG, n) 


Pxn(x|n) 





This is the binomial distribution with parameters n and p. The conditional expectation 
is thus by Example A of Section 4.1.2, np. | 
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EXAMPLE B_ Bivariate Normal Distribution 
From Example C in Section 3.5.2, if Y and X follow a bivariate normal distribution, 
the conditional density of Y given X is 
Oy : 
y — By — p—(& — ux) 
1 1 Ox 


ex ; 
ov Anü- p |a? eid — p 








fvix lx) = 


This is a normal density with mean uy + p(x — Ly oy /oy and variance o;(l — p?). 
The former is the conditional mean and the latter the conditional variance of Y given 
X — x. 

Note that the conditional mean is a linear function of X and that as || increases, 
the conditional variance decreases; both of these facts are suggested by the elliptical 
contours of the joint density. To see this more exactly, consider the case in which 
Ox = oy = land uy = uy = 0. The contours then are ellipses satisfying 


px? — 2pxy + y? = constant 


The major and minor axes of such an ellipse are at 45° and 135°. The conditional 
expectation of Y given X = x is the line y = px; note that this line does not lie along 
the major axis of the ellipse. Figure 4.8 shows such a bivariate normal distribution 
with p = 0.5. The curved lines of the bivariate density correspond to the conditional 
density of Y given various values of x, but they are not normalized to integrate to 
1. The contours of the bivariate normal are the ellipses shown in the xy plane as 
dashed curves, with the major axis shown by the straight dashed line. The conditional 
expectation of Y given X — x is shown as a function of x by the solid line in the 
plane. Note that it is not the major axis of the ellipse. 

This phenomenon was noted by Sir Francis Galton (1822-1911) who studied 
the relationship of the heights of sons to that of their fathers. He observed that 


fx, y») 
0 





FIGURE 4.8 Bivariate normal density with correlation, p — 0.5. The conditional 
expectation of Y given X — x is shown as the solid line in the xy plane. 
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sons of very tall fathers were shorter on average than their fathers and that sons 
of very short fathers were on average taller. The empirical relationship is shown in 
Figure 14.19. | 


Assuming that the conditional expectation of Y given X = x exists for every 
x in the range of X, it is a well-defined function of X and hence is a random 
variable, which we write as E(Y|X). For instance, in Example A we found that 
E(X|N =n) = np; thus, E(X|N) = Np is a random variable that is a function of 
N. Provided that the appropriate sums or integrals converge, this random variable has 
an expectation and a variance. Its expectation is E[E (Y | X)]; for this expression, note 
that since E(Y|X) is a random variable that is a function of X, the outer expectation 
can be taken with respect to the distribution of X (Theorem A of Section 4.1.1). The 
following theorem says that the average (expected) value of Y can be found by first 
conditioning on X, finding E (Y | X), and then averaging this quantity with respect to X. 


THEOREM A 
E(Y) = E[E(Y|X)]. 


Proof 


We will prove this for the discrete case. The continuous case is proved similarly. 
Using Theorem 4.1.1A we need to show that 


BO) =X BOX — 3) p) 


where 


E(Y|X =x) = X yprix ole) 


Interchanging the order of summation gives us 


Y EQIX 2 x)pxG) => y M prix bo px) 


(It can be shown that this interchange can be made.) From the law of total 
probability, we have 


pry) — 59 Py\x (yx) px (x) 


Therefore, 


Y.» ML prix lx) px@) = M vpro) = EQ) " 
y x » 


Theorem A gives what might be called a law of total expectation: The ex- 
pectation of a random variable Y can be calculated by weighting the conditional 
expectations appropriately and summing or integrating. 


150 . Chapter4 Expected Values 


EXAMPLEC 


EXAMPLED 


Suppose that in a system, a component and a backup unit both have mean lifetimes 
equal to u. If the component fails, the system automatically substitutes the backup 
unit, but there is probability p that something will go wrong and it will fail to do so. 
Let T be the total lifetime, and let X — 1 if the substitution of the backup takes place 
successfully, and X = 0 if it does not. Thus the total lifetime is the lifetime of the 
first component if the backup fails and is the sum of the lifetimes of the original and 
the backup units if the backup is successfully made. Then 


E(T|X =1)=2u 
E(T|X =0) =u 


Thus, 


E(T) =E(TIX=1)P(X=1)+ET|X=0P(X=0=u2-p) m 


Random Sums 
This example introduces sums of the type 


T=) X; 


isi 


where N is a random variable with a finite expectation and the X; are random variables 
that are independent of N and have the common mean E(X). Such sums arise in a 
variety of applications. An insurance company might receive N claims in a given 
period of time, and the amounts of the individual claims might be modeled as random 
variables X1, X5,.... The random variable N could denote the number of customers 
entering a store and X; the expenditure of the ith customer, or N could denote the 
number of jobs in a single-server queue and X; the service time for the ith job. 
For this last case, T is the time to serve all the jobs in the queue. According to 
Theorem A, 


E(T) = E[E(T|N)] 
Since E(T|N =n) = nE(X), E(T|N) = NE(X) and thus 
E(T) = E[NE(X)] = E(N)E(X) 
This agrees with the intuitive guess that the average time to complete N jobs, where 


N is random, is the average value of N times the average amount of time to complete 
a job. a 


We have seen that the expectation of the random variable E(Y|X) is E(Y). We 
now find its variance. 


EXAMPLEE 
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THEOREM B 
Var(Y) = Var[ E(Y|X)] + E[Var(Y | X)]. 


Proof 
We will explain what is meant by the notation in the course of the proof. First, 
Var(Y|X =x) = E(Y?|X 2 x) - [E(Y)|X 2 x) 


which is defined for all values of x. Thus, just as we defined E(Y|X) to bea 
random variable by letting X be random, we can define Var(Y | X) as a random 
variable. In particular, Var(Y | X) has the expectation E[Var(Y | X)]. Since 


Var(Y|X) = E(Y?|X) — [EY X] 
E[Var(Y|X)] = E[E(Y?|X)] — E(LEQ'|X)T 
Also, 
Var[E(Y|X)] = E(EQ'|X)P) — GELEQ'|X)]P 
The final piece that we need is 
Var(Y) = E(Y?) — [E(Y)P 
= E[E(?|X)] — (ELE? | X)] 
by the law of total expectation. Now we can put all the pieces together: 
Var(Y) = E[E(Y?|X)] — (ELE (Y| X)]? 
= E[E(Y?|X)] - E(LE(Y|X)T) + E{LE(@ XP} — {ELE XP 
= E[Var(Y | X)] + Var[E(Y|X)] u 


Random Sums 

Let us continue Example D but with the additional assumptions that the X; are inde- 
pendent random variables with the same mean, E (X), and the same variance, Var(X), 
and that Var(N) < oo. According to Theorem B, 


Var(T) = E[Var(T|N)] + VarLE(T|N)] 


Because E(T|N) = NE(X), 


Var[ E(T|N)] = [E (X) Var(N) 


Also, since Var(T|N = n) = Var(Y 5; , X;) = n Var(X), 


Var(T|N) = N Var(X) 


and 


E[Var(T|N)] = E(N)Var(X) 
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We thus have 
Var(T) = [E(X)P Var(N) + E(N)Var(X) 


If N is fixed, say, N — n, then Var(T) — n Var(X). Thus, we see from the preceding 
equation that extra variability occurs in T because N is random. 

As a concrete example, suppose that the number of insurance claims in a certain 
time period has expected value equal to 900 and standard deviation equal to 30, as 
would be the case if the number were a Poisson random variable with expected value 
900. Suppose that the average claim value is $1000 and the standard deviation is $500. 
Then the expected value of the total, T, of the claims is E(T) = $900,000 and the 
variance of T is 


Var (T) = 1000? x 900 + 900 x 500? 
— 1.125 x 10? 


The standard deviation of T is the square root of the variance, $33,541. The insurance 
company could then plan on total claims of $900,000 plus or minus a few standard 
deviations (by Chebyshev's inequality). Observe that if the total number of claims 
were not variable but were fixed at N = 900, the variance of the total claims would 
be given by E(N)Var(X) in the preceding expression. The result would be a standard 
deviation equal to $15,000. The variability in the number of claims thus contributes 
substantially to the uncertainty in the total. E 


Prediction 


This section treats the problem of predicting the value of one random variable from 
another. We might wish, for example, to measure the value of some physical quan- 
tity, such as pressure, using an instrument. The actual pressures to be measured are 
unknown and variable, so we might model them as values of a random variable, Y. 
Assume that measurements are to be taken by some instrument that produces a re- 
sponse, X, related to Y in some fashion but corrupted by random noise as well; X 
might represent current flow, for example. Y and X have some joint distribution, and 
we wish to predict the actual pressure, Y , from the instrument response, X. 

As another example, in forestry, the volume of a tree is sometimes estimated from 
its diameter, which is easily measured. For a whole forest, it is reasonable to model 
diameter (X) and volume (Y) as random variables with some joint distribution, and 
then attempt to predict Y from X. 

Let us first consider a relatively trivial situation: the problem of predicting Y by 
means of a constant value, c. If we wish to choose the “best” value of c, we need some 
measure of the effectiveness of a prediction. One that is amenable to mathematical 
analysis and that is widely used is the mean squared error: 


MSE = E[(Y — cy] 


This is the average squared error of prediction, the averaging being done with respect 
to the distribution of Y. The problem then becomes finding the value of c that min- 
imizes the mean squared error. To solve this problem, we denote E(Y) by u and 
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observe that (see Theorem A of Section 4.2.1) 
E[(Y — c] = Var(Y — o) + [E(Y — e) 
= Var(Y) + (u — c) 
The first term of the last expression does not depend on c, and the second term is 
minimized for c = u, which is the optimal choice of c. 
Now let us consider predicting Y by some function A (X) in order to minimize 


MSE = E([Y — h(X)p). From Theorem A of Section 4.4.1, the right-hand side can 
be expressed as 


E([Y — hQOT) = ECE(Y — A(X)P |X) 


The outer expectation is with respect to X. For every x, the inner expectation is 
minimized by setting A (x) equal to the constant E(Y|X = x), from the result of the 
preceding paragraph. We thus have that the minimizing function A (X) is 


h(X) = E(Y|X) 


For the bivariate normal distribution, we found that 
Oy 
E(Y|X) = uy + p—(X — ux) 
Ox 


This linear function of X is thus the minimum mean squared error predictor of Y 
from X. E 


A practical limitation of the optimal prediction scheme is that its implementation 
depends on knowing the joint distribution of Y and X in order to find E(Y|X), and 
often this information is not available, not even approximately. For this reason, we 
can try to attain the more modest goal of finding the optimal linear predictor of Y . (In 
Example A, it turned out that the best predictor was linear, but this is not generally 
the case.) That is, rather than finding the best function h among all functions, we try 
to find the best function of the form h(x) = a + Bx. This merely requires optimizing 
over the two parameters o and 8. Now 

E[(Y — a — BXy] = Var(Y — a — BX) + [EY — a — pX)? 
= Var(Y — 8X) + [EY — e — BX)P 
The first term of the last expression does not depend on o, so œ can be chosen so as 
to minimize the second term. To do this, note that 


E(Y — a — BX) = uy —o — Bux 
and that the right-hand side is zero, and hence its square is minimized, if 
a = ny — Bux 
As for the first term, 


Var(Y — 8X) = of + £o? — 2Boxy 
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where oxy = Cov(X, Y). This is a quadratic function of 6, and the minimum is 
found by setting the derivative with respect to 6 equal to zero, which yields 


Oxy | Oy 
pS Qe 
Ox Ox 


p is the correlation coefficient. Substituting in these values of o and $, we find that 
the minimum mean squared error predictor, which we denote by Y, is 


^ 


Y—-o-c-8X 
(OR 

= py + S$ (X - ux) 
Ox 


The mean squared prediction error is then 


c2 Oxy 
2 XY 22 
Var(Y = BX) = Oy + — 4 OX = 2— oxy 
Ox Ox 
2 
422. OXY 
Ox 
MS. 2.2 
= Sy —p Oy 
2 2 
= oy(1— p°) 


Note that the optimal linear predictor depends on the joint distribution of X 
and Y only through their means, variances, and covariance. Thus, in practice, it is 
generally easier to construct the optimal linear predictor or an approximation to it 
than to construct the general optimal predictor E(Y|X). Second, note that the form 
of the optimal linear predictor is the same as that of E(Y | X) for the bivariate normal 
distribution. Third, note that the mean squared prediction error depends only on oy 
and p and that it is small if p is close to +1 or —1. Here we see again, from a different 
point of view, that the correlation coefficient is a measure of the strength of the linear 
relationship between X and Y. 


Suppose that two examinations are given in a course. As a probability model, we 
regard the scores of a student on the first and second examinations as jointly distributed 
random variables X and Y. Suppose for simplicity that the exams are scaled to have 
the same means u = uy = puy and standard deviations o. = oy = oy. Then, 
the correlation between X and Y is p = oxy/o? and the best linear predictor is 
Y = u + p(X — p), so 


Y -u-p(X-p) 


Notice that by this equation we predict the student's score on the second examination 
to differ from the overall mean u by less than did the score on the first examination. 
If the correlation p is positive, this is encouraging for a student who scores below the 
mean on the first exam, since our best prediction is that his score on the next exam 
will be closer to the mean. On the other hand, it's bad news for the student who scored 
above the mean on the first exam, since our best prediction is that she will score closer 
to the mean on the next exam. This phenomenon is often referred to as regression to 
the mean. a 
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The Moment-Generating Function 


This section develops and applies some of the properties of the moment-generating 
function. It turns out, despite its unlikely appearance, to be a very useful tool that can 
dramatically simplify certain calculations. 

The moment-generating function (mgf) of a random variable X is M(t) = 
E (e'*) if the expectation is defined. In the discrete case, 


M(t) = M e* po) 


x 


and in the continuous case, 


oo 
M(t) = I e™ f(x)dx 
—oo 

The expectation, and hence the moment-generating function, may or may not exist 
for any particular value of t. In the continuous case, the existence of the expectation 
depends on how rapidly the tails of the density decrease; for example, because the 
tails of the Cauchy density die down at the rate x ?, the expectation does not exist 
for any ¢ and the moment-generating function is undefined. The tails of the normal 
density die down at the rate e^* , so the integral converges for all t. 


PROPERTY A 


If the moment-generating function exists for ¢ in an open interval containing 
zero, it uniquely determines the probability distribution. L| 


We cannot prove this important property here—its proof depends on properties 
of the Laplace transform. Note that Property A says that if two random variables have 
the same mgf in an open interval containing zero, they have the same distribution. 
For some problems, we can find the mgf and then deduce the unique probability 
distribution corresponding to it. 

The rth moment of a random variable is E(X") if the expectation exists. We 
have already encountered the first and second moments earlier in this chapter, that is, 
E(X) and E(X?). Central moments rather than ordinary moments are often used: The 
rth central moment is E([X — E(X)|']. The variance is the second central moment 
and is a measure of dispersion about the mean. The third central moment, called the 
skewness, is used as a measure of the asymmetry of a density or a frequency function 
about its mean; if a density is symmetric about its mean, the skewness is zero (see 
Problem 78 at the end of this chapter). As its name implies, the moment-generating 
function has something to do with moments. To see this, consider the continuous 
case: 


M(t) = T e™ f(x)dx 


oo 
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The derivative of M (t) is 


d- of os 
M'(t) = zl e" f (x) dx 


It can be shown that differentiation and integration can be interchanged, so that 


M(t) = ri xe" f (x) dx 


oo 


and 


M'(0) = f. xf (x) dx = E(X) 


eo 


Differentiating r times, we find 
M'?(0) = E(X") 


It can further be argued that if the moment-generating function exists in an interval 
containing zero, then so do all the moments. We thus have the following property. 


PROPERTY B 


If the moment-generating function exists in an open interval containing zero, 
then M (0) = E(X”). u 


To find the moments of a random variable from the definition of expectation, we 
must sum a series or carry out an integration. The utility of Property B is that, if the 
mef can be found, the process of integration or summation, which may be difficult, can 
be replaced by the process of differentiation, which is mechanical. We now illustrate 
these concepts using some familiar distributions. 


EXAMPLE A Poisson Distribution 
By definition, 


oo 


p- ve , 


k=0 
ee 





12mm 
Z eE 1) 


The sum converges for all t. Differentiating, we have 


M'(t) = 1e ee 9 
M"(t) = Ael e^ € - D + Ae” eD 
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Evaluating these derivatives at t = 0, we find 


E(X) 2 X 
E(X) 222 +A 


from which it follows that 
Var(X) = E(X’) - [E(X) F =A 


We have found that the mean and the variance of a Poisson distribution are 
equal. E 


EXAMPLE B Gamma Distribution 
The mgf of a gamma distribution is 


oo ro 
| e xt lg dx 
0 rœ) 
A2 
I'(a) Jo 


The latter integral converges for £ < A and can be evaluated by relating it to the 
gamma density having parameters o and A — t. We thus obtain 


" A? Træ) a N 
u= T'(o) (ais) i (25) 


Differentiating, we find 


M(t) 





xel e*t) dx 








M'(0) = E(X) = S 


a (o + 1) 


M'() EGO) = — 


From these equations, we find that 


Var(X) = E(X?) - [EGO 


a(a+1) o 
M X 
a 


EXAMPLE C Standard Normal Distribution 
For the standard normal distribution, we have 


M(t) = Xe7* [2 dx 


1 T. , 
— e 
A/ 27 J= 


158 . Chapter4 Expected Values 


The integral converges for all t and can be evaluated using the technique of completing 
the square. Since 








a t E 2tx + t°) 2i 
g3 wg i 2 
t2 
= — -H -— 
gn DOSS 


therefore, 





e? oo -awh 
M(t) = e* dx 
V =00 


27 


Making the change of variables u = x — t and using the fact that the standard normal 
density integrates to 1, we find that 


M(t) e? 


From this result, we easily see that E(X) = 0 and Var(X) = 1. a 


Let us continue with the development of the properties of the moment-generating 
function. 


PIROMUPIEIR IC 
If X has the mgf Mx (f) and Y = a+bX, then Y has the mgf My (t) = e” My (bt). 


Proof 


My(t) = E(e’”) 
z= IB (gt 
= JE y 
= ev E (e^ *) 


= e" My(bt) H 


EXAMPLE D General Normal Distribution 
If Y follows a general normal distribution with parameters jz and oø, the distribution 
of Y is the same as that of u + o X, where X follows a standard normal distribution. 
Thus, from Example C and Property C, 


My(t) = e" My(ot) = ete a 


EXAMPLEE 


EXAMPLEF 


4.5 The Moment-Generating Function 159 


PROPERTY D 


If X and Y are independent random variables with mgf's Mx and My and Z — 
X + Y, then Mz(t) = Mx(t)My(t) on the common interval where both mgf's 
exist. 


Proof 


Mz(t) = E(e'^) 
= Bee) 
= Be e”) 
From the assumption of independence, 


Mz(t) = IEE 
= Mx(t)My(t) - 


By induction, Property D can be extended to sums of several independent random 
variables. This is one of the most useful properties of the moment-generating function. 
The next three examples show how it can be used to easily derive results that would 
take a lot more work to achieve without recourse to the mgf. 


The sum of independent Poisson random variables is a Poisson random variable: If 
X is Poisson with parameter A and Y is Poisson with parameter u, then X + Y is 
Poisson with parameter A + u, since 


P i Eo 
ere Den D — etue 1) [| 


If X follows a gamma distribution with parameters a, and A and Y follows a gamma 
distribution with parameters o» and A, the mgf of X + Y is 


À on À a2 À æ+ o 
(4) G=) -(>) 


where t < à. The right-hand expression is the mgf of a gamma distribution with 
parameters à and a+. It follows from this that the sum of n independent exponential 
random variables with parameter A follows a gamma distribution with parameters n 
and A. Thus, the time between n consecutive events of a Poisson process in time follows 
a gamma distribution. Assuming that the service times in a queue are independent 
exponential random variables, the length of time to serve n customers follows a 
gamma distribution. E 
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EXAMPLEH 


If X ~ N(u, o?) and, independent of X, Y ~ N(v, t°), then the mgf of X + Y is 


nt Po? 2 gut oe 2 "y gt roo) 


e € 


which is the mgf of a normal distribution with mean u + v and variance o? + t?. The 
sum of independent normal random variables is thus normal. | 


The preceding three examples are atypical. In general, if two independent ran- 
dom variables follow some type of distribution, it is not necessarily true that their 
sum follows the same type of distribution. For example, the sum of two gamma ran- 
dom variables having different values for the parameter à does not follow a gamma 
distribution, as can be easily seen from the mgf. 

We now apply moment-generating functions to random sums of the type intro- 
duced in Section 4.4.1. Suppose that 


= Xj 


N 
i=1 


l 


where the X; are independent and have the same mgf, Mx, and where N has the mgf 
My and is independent of the X;. By conditioning, we have 


Ms(t) = ELE (e? |N)] 
Given N = n, Ms(t) = [Mx(t)]" from Property D. We thus have 


Ms(t) = E[My(t)%] 


= E (e o£ Mx) 
= Myllog Mx(t)] 


(We must carefully note the values of t for which this is defined.) 


Compound Poisson Distribution 

This example presents a model that occurs for certain chain reactions, or “cascade” 
processes. When a single primary electron, having been accelerated in an electrical 
field, hits a plate, several secondary electrons are produced. In a multistage multiplying 
tube, each of these secondary electrons hits another plate and thereby produces a 
number of tertiary electrons. The process can continue through several stages in this 
manner. Woodward (1948) considered models of this type in which the number of 
electrons produced by the impact of a single electron on the plate is random and, in 
particular, in which the number of secondary electrons follows a Poisson distribution. 
The number of electrons produced at the third stage is described by a random sum 
of the type just described, where N is the number of secondary electrons and X; is 
the number of electrons produced by the ith secondary electron. Suppose that the X; 
are independent Poisson random variables with parameter à and that N is a Poisson 
random variable with parameter u. According to the preceding result, the mgf of S, 
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the total number of particles, is 


Ms(t) = exp[u (e-d — 1)] a 


Example H illustrates the utility of the mgf. It would have been more difficult 
to find the probability mass function of the number of particles at the third stage. By 
differentiating the mgf, we can find the moments of the probability mass function 
(see Problem 98 at the end of this chapter). 

If X and Y have a joint distribution, their joint moment-generating function is 
defined as 


Mxy(s, t) = E(ge**") 


which is a function of two variables, s and t. If the joint mgf is defined on an open 
set containing the origin, it uniquely determines the joint distribution. The mgf of the 
marginal distribution of X alone is 


Mx(s) = Mxy(s, 0) 


and similarly for Y. It can be shown that X and Y are independent if and only if 
their joint mgf factors into the product of the mgf’s of the marginal distributions. 
E(XY) and other higher-order joint moments can be obtained from the joint mgf 
by differentiation. Analogous properties hold for the joint mgf of several random 
variables. 

The major limitation of the mgf is that it may not exist. The characteristic 
function of a random variable X is defined to be 


p(t) = E(e"*) 


where i = ./—1. In the continuous case, 


eo 
$0)- l e" f(x)dx 
=00. 

This integral converges for all values of t, since |e/?| < 1. The characteristic func- 
tion is thus defined for all distributions. Its properties are similar to those of the 
mgf: Moments can be obtained by differentiation, the characteristic function changes 
simply under linear transformations, and the characteristic function of a sum of inde- 
pendent random variables is the product of their characteristic functions. But using 
the characteristic function requires some familiarity with the techniques of complex 
variables. 


Approximate Methods 


In many applications, only the first two moments of a random variable, and not 
the entire probability distribution, are known, and even these may be known only 
approximately. We will see in Chapter 5 that repeated independent observations of a 
random variable allow reliable estimates to be made of its mean and variance. Suppose 
that we know the expectation and the variance of a random variable X but not the 
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entire distribution, and that we are interested in the mean and variance of Y — g(X) 
for some fixed function g. For example, we might be able to measure X and determine 
its mean and variance, but really be interested in Y, which is related to X in a known 
way. We might want to know Var(Y), at least approximately, in order to assess the 
accuracy of the indirect measurement process. From the results given in this chapter, 
we cannot in general find E(Y) = jy and Var(Y) = 0; from E(X) = ux and 
Var(X) = of, unless the function g is linear. However, if g is nearly linear in a range 
in which X has high probability, it can be approximated by a linear function and 
approximate moments of Y can be found. 

In proceeding as just described, we follow a tack often taken in applied math- 
ematics: When confronted with a nonlinear problem that we cannot solve, we lin- 
earize. In probability and statistics, this method is called propagation of error, or the 
ó method. Linearization is carried out through a Taylor series expansion of g about 
ux. To the first order, 


Y = g(X) © g(ux) + (X — ux)g (ux) 


We have expressed Y as approximately equal to a linear function of X. Recalling that 
if U =a + bV, then E(U) = a + bE(V) and Var(U) = P? Var(V), we find 


Uy © g(ux) 


oy © agg (ux) 


We know that in general E(Y) 4 g(E(X)), as given by the approximation. In fact, 
we can carry out the Taylor series expansion to the second order to get an improved 
approximation of uy: 


Y = g(X) © g(ux) + (X — ux)g (ux) + (X — nx} g" (ux) 
Taking the expectation of the right-hand side, we have, since E(X — ux) = 0, 
E(Y) © g(ux) + 1028" (ux) 


How good such approximations are depends on how nonlinear g is in a neighbor- 
hood of ux and on the size of ox. From Chebyshev’s inequality, we know that X is 
unlikely to be many standard deviations away from jx; if g can be reasonably well 
approximated in this range by a linear function, the approximations for the moments 
will be reasonable as well. 


The relation of voltage, current, and resistance is V = I R. Suppose that the voltage 
is held constant at a value Vo across a medium whose resistance fluctuates randomly 
as a result, say, of random fluctuations at the molecular level. The current therefore 
also varies randomly. Suppose that it can be determined experimentally to have mean 
Hı ~ 0 and variance 87. We wish to find the mean and variance of the resistance, R, 
and since we do not know the distribution of /, we must resort to an approximation. 


EXAMPLE B 
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We have 
R-g()- Vo 
Vo 2 Vo 
g(u)---—£8(uD--— 
HT HT 
Thus, 
Vo Vo 
Uur — + — o} 
Mi Mr 
Ve 
OR © A i 
1 
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We see that the variability of R depends on both the mean level of 7 and the variance 
of I. This makes sense, since if J is quite small, small variations in J will result in 
large variations in R = Vo/I, whereas if J is large, small variations will not affect R 
as much. The second-order correction factor for upg also depends on jz; and is large if 
Hı is small. In fact, when Z is near zero, the function g(/) = Vo/I is quite nonlinear, 


and the linearization is not a good approximation. 


This example examines the accuracy of the approximations using a simple test case. 
We choose the function g(x) = ./x and consider two cases: X uniform on [0, 1], 
and X uniform on [1, 2]. The graph of g(x) in Figure 4.9 shows that g is more nearly 
linear in the latter case, so we would expect the approximations to work better there. 


Let Y — JX: because X is uniform on [0, 1], 


1 
eq) = [ Jx dx = % 
0 


g(x) 
oo 


a 





0 3 1.0 1.5 2.0 


FIGURE 4.9 The function g(x) = yx is more nearly linear over the interval [1, 2] 


than over the interval [0, 1]. 
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and 


1 
E(Y?) zi xdx =} 
0 


so Var(Y) = i — GY = u and oy — .236. These results are exact. 


Using the approximation method, we first calculate 
8 '(x)- 3* 
g” (x) = d E 


Since X is uniform on [0, 1], wy = i, and evaluating the derivatives at this value 
gives us 


g (Aux) = 
2 
g’ (ux) = 232 


We know that Var(X) — ds for a random variable uniform on [0, 1], so the approxi- 


mations are 
1 2 
Eqs J1- v2) _ 67g 
2 12x 2 


Var(Y) ~ $ x 4 = .042 


e 





N 


The approximation to the mean is .678, and compared to the actual value of .667, it is 
off by about 1.6%. The approximation to the standard deviation is .204, and compared 
to the actual value of .236, it is off by 13%. 

Now let us consider the case in which X is uniform on [1, 2]. Proceeding as 
before, we find that y = ,/x has mean 1.219. The variance and standard deviation are 
.0142 and .119, respectively. To compare these to the approximations, we note that 
Ux = 3 and Var(X) = 5 (the random variable uniform on [1, 2] can be obtained by 
adding the constant 1 to a random variable uniform on [0, 1]; compare with Theorem A 
in Section 4.2). We find 


g (ux) = .408 
8" (Ux) = —.136 


so the approximations are 


3 17.136 
E(Y) & " E (=) = 1.219 


.408? 
12 


oy © .118 


Var(Y) ~ 0138 





These values are much closer to the actual values than are the approximations for the 
first case. E 
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Suppose that we have Z = g(X, Y), a function of two variables. We can again 
carry out Taylor series expansions to approximate the mean and variance of Z. To the 
first order, letting jz denote the point (ux, uy), 


a 
Z = g(X, Y) © g(u) + (X — ux) 80 (y "m sw) 


The notation ðg(u)/ðx means that the derivative is evaluated at the point u. Here Z 
is expressed as approximately equal to a linear function of X and Y, and the mean 
and variance of this linear function are easily calculated to be 


E(Z) © g(u) 


a a s: à a 
venen (S) (+30) (9 


(For the latter calculation, see Corollary A in Section 4.3.) As is the case with a single 
variable, a second-order expansion can be used to obtain an improved estimate of 
E(Z): 





and 














uo ðg (u) 
Z= (X, Y) N gU) + (X -uo EP 4 y — uy) Em 
a? = 3g (u) 
+ ;X yxy + Y= uy 2-86 
2 dy 
gW) 
+X = i3) — ur) È 
ðxəðy 
Taking expectations term by term on the right-hand side yields 


| ,0°g(u) 1 j9?g(u) 9? g(u) 
E(Z) & 
(Z) © alu) + 5e a + 9% ay? | OXY gygy 


The general case of a function of n variables can be worked out similarly; the basic 
concepts are illustrated by the two-variable case. 





Expectation and Variance of a Ratio 
Let us consider the case where Z = Y/X, which arises frequently in practice. For 
example, a chemist might measure the concentrations of two substances, both with 
some measurement error that is indicated by their standard deviations, and then report 
the relative concentrations in the form of a ratio. What is the approximate standard 
deviation of the ratio, Z? 

Using the method of propagation of error derived above, for g(x, y) = y/x, we 
have 





0g -y 0g 1 
ax x2 dy x 
9g 2y ^g 
ax2 x3 ay? 
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Evaluating these derivatives at (ux, jy) and using the preceding result, we find, if 
Ax #0, 

My My Oxy 

E(Z) & Tol 3 

Mx Hx Hx 


By 1 u 
= + 2 (ot d poxov) 
Hx 


From this equation, we see that the difference between E (Z) and uy /u x depends on 
several factors. If ox and oy are small—that is, if the two concentrations are measured 
quite accurately—the difference is small. If ux is small, the difference is relatively 
large. Finally, correlation between X and Y affects the difference. 

We now consider the variance. Again using the preceding result and evaluating 
the partial derivatives at (uy, uy), we find 














2 2 
My Oy HY 
Var(Z) ~ o2 + 2oxy 
“uk Mx m 
u 
== ( x Troy 229 
Hx X x 


From this equation, we see that the ratio is quite variable when ju x is small, paralleling 
the results in Example A, and that correlation between X and Y, if of the same sign 
as [Ly / I. x, decreases Var(Z). L| 


4.7 Problems 


1. Show that if a random variable is bounded—that is, |X| « M < oo—then 
E(X) exists. 


2. If X is a discrete uniform random variable—that is, P(X — k) — 1/n fork — 
1, 2, ..., n—find E(X) and Var(X). 


3. Find E(X) and Var(X) for Problem 3 in Chapter 2. 


4. Let X have the cdf F(x) 21—x *,x > 1. 


a. Find E(X) for those values of æ for which E(X) exists. 
b. Find Var(X) for those values of o for which it exists. 


5. Let X have the density 





Find E(X) and Var(X). 


6. 


10. 


11. 


12. 


13. 


4.7 Problems 167 


Let X be a continuous random variable with probability density function 

f(x) =2x,0<x <1. 

a. Find E(X). 

b. Let Y = X?. Find the probability mass function of Y and use it to find E(Y). 

c. Use Theorem A in Section 4.1.1 to find E(X?) and compare to your answer 
in part (b). 

d. Find Var(X) according to the definition of variance given in Section 4.2. 
Also find Var(X) by using Theorem B of Section 4.2. 


. Let X beadiscrete random variable that takes on values 0, 1, 2 with probabilities 


Lh $ p respectively. 

a. Find E(X). 

b. Let Y = X?. Find the probability mass function of Y and use it to find E(Y). 

c. Use Theorem A of Section 4.1.1 to find E(X?) and compare to your answer 
in part (b). 

. Find Var(X) according to the definition of variance given in Section 4.2. 
Also find Var(X) by using Theorem B in Section 4.2. 


a 


. Show that if X is a discrete random variable, taking values on the positive 


integers, then E(X) = $7, P(X > k). Apply this result to find the expected 
value of a geometric random variable. 


. Thisisasimplified inventory problem. Suppose that it costs c dollars to stock an 


item and that the item sells for s dollars. Suppose that the number of items that 
will be asked for by customers is a random variable with the frequency function 
p(k). Find a rule for the number of items that should be stocked in order to 
maximize the expected income. (Hint: Consider the difference of successive 
terms.) 


A list of n items is arranged in random order; to find a requested item, they 
are searched sequentially until the desired item is found. What is the expected 
number of items that must be searched through, assuming that each item is 
equally likely to be the one requested? (Questions of this nature arise in the 
design of computer algorithms.) 


Referring to Problem 10, suppose that the items are not equally likely to be 
requested but have known probabilities pj, p2, ..., Pn. Suggest an alternative 
searching procedure that will decrease the average number of items that must 
be searched through, and show that in fact it does so. 


If X is a continuous random variable with a density that is symmetric about 
some point, £, show that E(X) = &, provided that E(X) exists. 


If X is a nonnegative continuous random variable, show that 


E(X) = fu — F(x)]dx 
0 


Apply this result to find the mean of the exponential distribution. 
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14. 


15. 


16. 


17. 


18. 


19. 
20. 


21. 


22. 


23. 


24. 
25. 


26. 


27. 


28. 


29. 
30. 


Let X be a continuous random variable with the density function 
f(x) = 2x, O0<x<l 

a. Find E(X). 

b. Find E(X?) and Var(X). 


Suppose that two lotteries each have n possible numbers and the same payoff. 
In terms of expected gain, is it better to buy two tickets from one of the lotteries 
or one from each? 


Suppose that E(X) = u and Var(X) = o?. Let Z = (X — u)/o. Show that 
E(Z) = 0 and Var(Z) = 1. 


Find (a) the expectation and (b) the variance of the kth-order statistic of a sample 
of n independent random variables uniform on [0, 1]. The density function is 
given in Example C in Section 3.7. 


If U;, ..., U, are independent uniform random variables, find E (Um) — Ua). 
Find E (Uœ) — Uq.))), where the Ug) are as in Problem 18. 


A stick of unit length is broken into two pieces. Find the expected ratio of the 
length of the longer piece to the length of the shorter piece. 


A random square has a side length that is a uniform [0, 1] random variable. 
Find the expected area of the square. 


A random rectangle has sides the lengths of which are independent uniform 
random variables. Find the expected area of the rectangle, and compare this 
result to that of Problem 21. 


Repeat Problems 21 and 22 assuming that the distribution of the lengths is 
exponential. 


Prove Theorem A of Section 4.1.2 for the discrete case. 


If X, and X» are independent random variables following a gamma distribution 
with parameters o and A, find E(R?), where R? = X? + X3. 


Referring to Example B in Section 4.1.2, what is the expected number of 
coupons needed to collect r different types, where r < n? 


If n men throw their hats into a pile and each man takes a hat at random, what 
is the expected number of matches? (Hint: Express the number as a sum.) 


Suppose that n enemy aircraft are shot at simultaneously by m gunners, that 
each gunner selects an aircraft to shoot at independently of the other gunners, 
and that each gunner hits the selected aircraft with probability p. Find the 
expected number of aircraft hit by the gunners. 


Prove Corollary A of Section 4.1.1. 
Find E[1/(X + 1)], where X is a Poisson random variable. 


31. 


32. 


33. 
34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


42. 


43. 
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Let X beuniformly distributed on the interval [1, 2]. Find E(1/ X).Is E(1/ X) — 
1/E(X)? 


Let X have a gamma distribution with parameters o and A. For those values of 
a and à for which it is defined, find E(1/ X). 


Prove Chebyshev's inequality in the discrete case. 


Let X be uniform on [0, 1], and let Y — 4/X. Find E(Y) by (a) finding the 
density of Y and then finding the expectation and (b) using Theorem A of 
Section 4.1.1. 


Find the mean of a negative binomial random variable. (Hint: Express the 
random variable as a sum.) 


Consider the following scheme for group testing. The original lot of samples is 
divided into two groups, and each of the subgroups is tested as a whole. If either 
subgroup tests positive, it is divided in two, and the procedure is repeated. If 
any of the groups thus obtained tests positive, test every member of that group. 
Find the expected number of tests performed, and compare it to the number 
performed with no grouping and with the scheme described in Example C in 
Section 4.1.2. 


For what values of p is the group testing of Example C in Section 4.1.2 inferior 
to testing every individual? 


This problem continues Example A of Section 4.1.2. 


a. What is the probability that a fragment is the leftmost member of a 
contig? 

b. What is the expected number of fragments that are leftmost members of 
contigs? 

c. What is the expected number of contigs? 


Suppose that a segment of DNA of length 1,000,000 is to be shotgun sequenced 

with fragments of length 1000. 

a. How many fragment would be needed so that the chance of an individual 
site being covered is greater than 0.99? 

b. With this choice, how many sites would you expect to be missed? 


A child types the letters Q, W, E, R, T, Y, randomly producing 1000 letters in 
all. What is the expected number of times that the sequence QQQQ appears, 
counting overlaps? 


Continuing with the previous problem, how many times would we expect the 
word “TRY” to appear? Would we be surprised if it occurred 100 times? (Hint: 
Consider Markov's inequality.) 


Let X be an exponential random variable with standard deviation o. Find 
P(|X — E(X)| > ko) for k = 2, 3, 4, and compare the results to the bounds 
from Chebyshev's inequality. 


Show that Var(X — Y) — Var(X) 4- Var(Y) — 2Cov(X, Y). 
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44 


45. 


46. 
47. 


48. 


49. 


50. 


51. 


52. 


53. 


If X and Y are independent random variables with equal variances, find 
Cov(X + Y, X — Y). 


Find the covariance and the correlation of N; and N;, where Nj, No, ..., N, 
are multinomial random variables. (Hint: Express them as sums.) 


If U =a+ bX and V = c + dY, show that |puy| = |pxy|. 


If X and Y are independent random variables and Z — Y — X, find expressions 
for the covariance and the correlation of X and Z in terms of the variances of 
X and Y. 


Let U and V be independent random variables with means jz and variances o°. 


Let Z = «U + V1 — o?. Find E(Z) and pyz. 


Two independent measurements, X and Y, are taken of a quantity u. E(X) — 
E(Y) = n, but ox and oy are unequal. The two measurements are combined 
by means of a weighted average to give 


Z=aX+(l—a)Y 
where a is a scalar and O <a < 1. 


a. Show that E(Z) = n. 

b. Find o in terms of oy and oy to minimize Var(Z). 

c. Under what circumstances is it better to use the average (X + Y)/2 than 
either X or Y alone? 


Suppose that X;, where i = 1, ..., n, are independent random variables with 
E(X;) = wand Var(X;) = o°. Let X = n~! 35, Xi. Show that E(X) = u 
and Var(X) = o?/n. 


Continuing Example E in Section 4.3, suppose there are n securities, each with 
the same expected return, that all the returns have the same standard deviations, 
and that the returns are uncorrelated. What is the optimal portfolio vector? Plot 
the risk of the optimal portfolio versus n. How does this risk compare to that 
incurred by putting all your money in one security? 


Consider two securities, the first having 4; = 1 and o; = 0.1, and the second 
having u2 = 0.8 and o; = 0.12. Suppose that they are negatively correlated, 
with o — —0.8. 

a. If you could only invest in one security, which one would you choose, and 
why? 

b. Suppose you invest 5096 of your money in each of the two. What is your 
expected return and what is your risk? 

c. If you invest 80% of your money in security 1 and 20% in security 2, what 
is your expected return and your risk? 

d. Denote the expected return and its standard deviation as functions of x by 
IA (x1) and o (7x). The pair (u(r), o (7)) trace out a curve in the plane as z 
varies from 0 to 1. Plot this curve. 

e. Repeat b-d if the correlation is p = 0.1. 


Show that Cov(X, Y) < ~y Var(X)Var(Y ). 


54. 


55. 


56. 


57. 


58. 


59. 


60. 


61. 


4.7 Problems 171 


Let X, Y, and Z be uncorrelated random variables with variances ore a, and 


02, respectively. Let 
U=Z+X 
V=Z+Y 


Find Cov(U, V) and pyy. 


Let T = 575, , kX,, where the X, are independent random variables with 
means jz and variances o?. Find E(T) and Var(T). 


Let S = pm X;, where the X, are as in Problem 55. Find the covariance and 
the correlation of S and T. 


If X and Y are independent random variables, find Var(XY) in terms of the 
means and variances of X and Y. 


A function is measured at two points with some error (for example, the position 
of an object is measured at two times). Let 


Xi = fate 
X) = f(x Fh) € 


where £; and &» are independent random variables with mean zero and variance 
c?. Since the derivative of f is 


li fxh) — f(x) 
im —————— 
h—0 h 


it is estimated by 
| X»—Xi 
7 h 


a. Find E(Z) and Var(Z). What is the effect of choosing a value of h that is 
very small, as is suggested by the definition of the derivative? 


Z 


b. Find an approximation to the mean squared error of Z as an estimate of f'(x) 


using a Taylor series expansion. Can you find the value of / that minimizes 
the mean squared error? 

c. Suppose that f is measured at three points with some error. How could you 
construct an estimate of the second derivative of f, and what are the mean 
and the variance of your estimate? 


Let (X, Y) be a random point uniformly distributed on a unit disk. Show that 
Cov(X, Y) = 0, but that X and Y are not independent. 


Let Y have a density that is symmetric about zero, and let X = SY, where Sis an 
independent random variable taking on the values 4-1 and —1 with probability 
i each. Show that Cov(X, Y) — 0, but that X and Y are not independent. 


In Section 3.7, the joint density of the minimum and maximum of n independent 
uniform random variables was found. In the case n = 2, this amounts to X 
and Y, the minimum and maximum, respectively, of two independent random 
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62. 


63. 


64. 


65. 


66. 


67. 


68. 
69. 


variables uniform on [0, 1], having the joint density 
fœ, y) =2, O<x<y 


a. Find the covariance and the correlation of X and Y. Does the sign of the 
correlation make sense intuitively? 

b. Find E(X|Y = y)and E(Y|X = x). Do these results make sense intuitively? 

c. Find the probability density functions of the random variables E(X|Y) and 
E(Y|X). 

d. What is the linear predictor of Y in terms of X (denoted by Y = a + bX) 
that has minimal mean squared error? What is the mean square prediction 
error? 

e. What is the predictor of Y in terms of X [Ê = h(X)] that has minimal mean 
squared error? What is the mean square prediction error? 


Let X and Y have the joint distribution given in Problem 1 of Chapter 3. 


a. Find the covariance and correlation of X and Y. 
b. Find E(Y|X = x) for x = 1, 2, 3, 4. Find the probability mass function of 
the random variable E(Y|X). 


Let X and Y have the joint distribution given in Problem 8 of Chapter 3. 


a. Find the covariance and correlation of X and Y. 
b. Find E(Y|X = x) for0 € x < 1. 


Let X and Y be jointly distributed random variables with correlation px y; define 
the standardized random variables X and Y as X — (X — E(X)) / A Var(X) 
and Y = (Y — E(Y))/A/Var(Y). Show that Cov(X, Y) = pxy. 


How has the assumption that N and the X; are independent been used in 
Example D of Section 4.4.1? 


A building contains two elevators, one fast and one slow. The average waiting 
time for the slow elevator is 3 min. and the average waiting time of the fast 
elevator is 1 min. If a passenger chooses the fast elevator with probability i and 
the slow elevator with probability i what is the expected waiting time? (Use 
the law of total expectation, Theorem A of Section 4.4.1, defining appropriate 
random variables X and Y.) 


A random rectangle is formed in the following way: The base, X, is chosen 
to be a uniform [0, 1] random variable and after having generated the base, 
the height is chosen to be uniform on [0, X]. Use the law of total expectation, 
Theorem A of Section 4.4.1, to find the expected circumference and area of the 
rectangle. 


Show that E[Var(Y|X)] < Var(Y). 


Suppose that a bivariate normal distribution has wy = wy = 0 and ox = 
oy = 1. Sketch the contours of the density and the lines E(Y|X = x) and 
E(X|Y = y) for p = 0, .5, and .9. 


70. 
71. 


72. 


73. 


74. 


75. 


76. 


TT. 


78. 
79. 


80. 


81. 


82. 
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If X and Y are independent, show that E(X|Y = y) = E(X). 


Let X be a binomial random variable representing the number of successes in 
n independent Bernoulli trials. Let Y be the number of successes in the first m 
trials, where m < n. Find the conditional frequency function of Y given X = x 
and the conditional mean. 


An item is present in a list of n items with probability p; if itis present, its posi- 
tion in the list is uniformly distributed. A computer program searches through 
the list sequentially. Find the expected number of items searched through before 
the program terminates. 


A fair coin is tossed n times, and the number of heads, N, is counted. The coin 
is then tossed N more times. Find the expected total number of heads generated 
by this process. 


The number of offspring of an organism is a discrete random variable with 
mean u and variance o?. Each of its offspring reproduces in the same man- 
ner. Find the expected number of offspring in the third generation and its 
variance. 


Let T be an exponential random variable, and conditional on T, let U be uniform 
on [0, T]. Find the unconditional mean and variance of U. 


Let the point (X, Y) be uniformly distributed over the half disk x? + y? < 1, 
where y > 0. If you observe X, what is the best prediction for Y ? If you observe 
Y, what is the best prediction for X? For both questions, best" means having 
the minimum mean squared error. 


Let X and Y have the joint density 
fœ, yy =e”, Oxy 


a. Find Cov(X, Y) and the correlation of X and Y. 
b. Find E(X|Y = y) and E(Y|X = x). 
c. Find the density functions of the random variables E(X|Y) and E(Y|X). 


Show that if a density is symmetric about zero, its skewness is zero. 


Let X be a discrete random variable that takes on values 0, 1, 2 with probabilities 
p e $, respectively. Find the moment-generating function of X, M(t), and 
verify that E(X) = M’ (0) and that E(X?) = M” (0). 

Let X be a continuous random variable with density function f(x) = 2x, 
0 < x < 1. Find the moment-generating function of X, M(t), and verify that 
E(X) = M'(0) and that E(X?) = M"(0). 


Find the moment-generating function of a Bernoulli random variable, and use 
it to find the mean, variance, and third moment. 


Use the result of Problem 81 to find the mgf of a binomial random variable and 
its mean and variance. 
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83. 


84. 


85. 


86. 


87. 


88. 


89. 


90. 


91. 


92. 


93. 


94. 


Show that if X; follows a binomial distribution with n; trials and probability of 
success p; = p, wherei = 1, ..., n andthe X; are independent, then 9 5; , X; 
follows a binomial distribution. 


Referring to Problem 83, show that if the p; are unequal, the sum does not 
follow a binomial distribution. 


Find the mgf of a geometric random variable, and use it to find the mean and 
the variance. 


Use the result of Problem 85 to find the mgf of a negative binomial random 
variable and its mean and variance. 


Under what conditions is the sum of independent negative binomial random 
variables also negative binomial? 


Let X and Y be independent random variables, and let œ and £ be scalars. Find 
an expression for the mgf of Z = aX + BY in terms of the mgf's of X and Y. 


Let X4, X2, ..., X, be independent normal random variables with means 
Hi and variances di Show that Y — je o; X;, where the o; are scalars, 
is normally distributed, and find its mean and variance. (Hint: Use moment- 
generating functions.) 


Assuming that X ^ N(0, o°), use the mgf to show that the odd moments are 
zero and the even moments are 

u (2n) lo?" 

Han = 2n (n D 


Use the mgf to show that if X follows an exponential distribution, cX (c > 0) 
does also. 


Suppose that O is a random variable that follows a gamma distribution with pa- 
rameters à and o, where o is an integer, and suppose that, conditional on 
©, X follows a Poisson distribution with parameter ©. Find the uncondi- 
tional distribution of o + X. (Hint: Find the mgf by using iterated conditional 
expectations.) 


Find the distribution of a geometric sum of exponential random variables by 
using moment-generating functions. 


If X is a nonnegative integer-valued random variable, the probability- 
generating function of X is defined to be 


GG)-»; s'p. 
k=0 


where p, = P(X =k). 
a. Show that 
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b. Show that 
dG 
—— = E(X 
d (X) 
sex 
dG 
ae = E[X(X —1 
a [xix - D] 





c. Express the probability-generating function in terms of moment-generating 
function. 
d. Find the probability-generating function of the Poisson distribution. 


Show that if X and Y are independent, their joint moment-generating function 
factors. 


Show how to find E(XY) from the joint moment-generating function of X 
and Y. 


Use moment-generating functions to show that if X and Y are independent, 
then 


Var(aX + bY) = a? Var(X) + b’Var(Y) 


Find the mean and variance of the compound Poisson distribution (Example H 
in Section 4.5). 


Find expressions for the approximate mean and variance of Y = g(X) for (a) 
g(x) = Jx, (b) g(x) = log x, and (c) g(x) = sin"! x. 


If X is uniform on [10, 20], find the approximate and exact mean and variance 
of Y = 1/X, and compare them. 


Find the approximate mean and variance of Y = ~v X, where X is a random 
variable following a Poisson distribution. 


Two sides, xo and yo, of a right triangle are independently measured as X and 
Y, where E(X) = xo and E(Y) = yo and Var(X) = Var(Y) = o°. The angle 
between the two sides is then determined as 


Y 
© = tan! (x) 
X 


Find the approximate mean and variance of ©. 


The volume of a bubble is estimated by measuring its diameter and using the 
relationship 

Tou 

V= -D 

6 
Suppose that the true diameter is 2 mm and that the standard deviation of the 
measurement of the diameter is .01 mm. What is the approximate standard 
deviation of the estimated volume? 


The position of an aircraft relative to an observer on the ground is estimated 
by measuring its distance r from the observer and the angle 0 that the line of 
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sight from the observer to the aircraft makes with the horizontal. Suppose that 
the measurements, denoted by R and ©, are subject to random errors and are 
independent of each other. The altitude of the aircraft is then estimated to be 
Y = RsinG. 

a. Find an approximate expression for the variance of Y. 

b. For given r, at what value of 0 is the estimated altitude most variable? 


5.1 


5.2 


CHAPTER 5 


Limit Theorems 


Introduction 


This chapter is principally concerned with the limiting behavior of the sum of inde- 
pendent random variables as the number of summands becomes large. The results 
presented here are both intrinsically interesting and useful in statistics, since many 
commonly computed statistical quantities, such as averages, can be represented as 
sums. 


The Law of Large Numbers 


It is commonly believed that if a fair coin is tossed many times and the proportion of 
heads is calculated, that proportion will be close to 1. John Kerrich, a South African 
mathematician, tested this belief empirically while detained as a prisoner during 
World War II. He tossed a coin 10,000 times and observed 5067 heads. The law of 
large numbers is a mathematical formulation of this belief. The successive tosses of 
the coin are modeled as independent random trials. The random variable X; takes on 
the value 0 or 1 according to whether the ith trial results in a tail or a head, and the 
proportion of heads in n trials is 


X, = 


Sle 


Xx 
i=l 


1 


The law of large numbers states that X,, approaches 5 


the following theorem. 


in a sense that is specified by 
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THEOREM A Law of Large Numbers 


Let X4, X2,..., X;... be a sequence of independent random variables with 
EX and Var(X.)— o^ Let Xx, =n v X Then for any © = 0; 


i 


P(|X,— u| » s) — 0 asn — oo 


Proof 
We first find E(X,) and Var(X, ): 


= ee 
E A = — E Xi — 
E=- m (Xi) =e 
Since the X; are independent, 
= ee o? 
Var(Xn) = —; X Var(X;) = F 
i 


The desired result now follows immediately from Chebyshev's inequality, which 
states that 
Var(X,) o? 


PX. i u| = £) = ED 5 


— 0, asn — oo L| 





In the case of a fair coin toss, the X; are Bernoulli random variables with p = 1/2, 
E(X;) — 1/2 and Var(X;) — 1/4. If tossed 10,000 times 


Var(X 10,000) = 2.5 x 10? 


and the standard deviation of the average is the square root of the variance, 0.005. 
The proportion observed by Kerrich, 0.5067, is thus a little more than one standard 
deviation away from its expected value of 0.5, consistent with Chebyshev's inequality. 
(Recall from Section 4.2 that Chebyshev's inequality can be written in the form 
P(X, ni u| > ko) < 1/k°.) 

If a sequence of random variables, {Z,}, is such that P(|Z, — «| > £) approaches 
zero as n approaches infinity, for any € > 0 and where o is some scalar, then Z, is 
said to converge in probability to o. There is another mode of convergence, called 
strong convergence or almost sure convergence, which asserts more. Z, is said to 
converge almost surely to o if for every € > 0,|Z, — a| > £ only a finite number 
of times with probability 1; that is, beyond some point in the sequence, the difference 
is always less than £, but where that point is random. The version of the law of large 
numbers stated and proved earlier asserts that X,, converges to jz in probability. This 
version is usually called the weak law of large numbers. Under the same assumptions, 
a strong law of large numbers, which asserts that X„ converges almost surely to y, 
can also be proved, but we will not do so. 

We now consider some examples that illustrate the utility of the law of large 
numbers. 


EXAMPLEA 


EXAMPLE B 
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Monte Carlo Integration 
Suppose that we wish to calculate 


1 
iz | f (x) dx 


where the integration cannot be done by elementary means or evaluated using ta- 
bles of integrals. The most common approach is to use a numerical method in which 
the integral is approximated by a sum; various schemes and computer packages ex- 
ist for doing this. Another method, called the Monte Carlo method, works in the 
following way. Generate independent uniform random variables on [0, 1]—that is, 
Xi, X5, ..., X,—and compute 


. 1 n 
I = — Xi 
D=- 3 f(X) 
By the law of large numbers, this should be close to E[ f(X)], which is simply 


1 
rifo = | f(x) dx =I(f) 


This simple scheme can be easily modified in order to change the range of integration 
and in other ways. Compared to the standard numerical methods, it is not especially 
efficient in one dimension, but becomes increasingly efficient as the dimensionality 
of the integral grows. 

As a concrete example, let us consider the evaluation of 


I(f) = a gee dx 
— An Jo 


The integralis that of the standard normal density, which cannot be evaluated in closed 
form. From the table of the normal distribution (Table 2 in Appendix B), an accurate 
numerical approximation is Z (f) = .3413. If 1000 points, X,,..., X1000, uniformly 
distributed over the interval 0 < x < 1, are generated using a pseudorandom number 
generator, the integral is then approximated by 


1000 


TT E EN —X?/2 
14D = 1999 (ze) » 


which produced for one realization of the X; the value .3417. E 


Repeated Measurements 

Suppose that repeated independent unbiased measurements, X,,..., Xn, Of a quantity 
are made. If n is large, the law of large numbers says that X will be close to the true 
value, u, of the quantity, but how close X is depends not only on n but on the variance 
of the measurement error, 07, as can be seen in the proof of Theorem A. 
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EXAMPLEC 


Fortunately, c? can be estimated and therefore 


= o? 
Var(X) = — 
n 


can be estimated from the data to assess the precision of X. First, notethatn ! $5; , X? 
converges to E(X 2), from the law of large numbers. Second, it can be shown that if 
Z, converges to o in probability and g is a continuous function, then 


8(Z,) > g(a) 
which implies that 
X > [EXP 


Finally, since n^! $7; , X? converges to E(X?) and x converges to [E( X)]^, with a 
little additional argument it can be shown that 


I< 2 x2 2 2 
2X =X > E(X’) — [EOL = Var(X) 


i=l 


More generally, it follows from the law of large numbers that the sample moments, 
n^! $5. , XL, converge in probability to the moments of X, E(X”). 


A muscle or nerve cell membrane contains a very large number of channels; when 
open, these channels allow ions to pass through. Individual channels seem to open and 
close randomly, and it is often assumed that in an equilibrium situation the channels 
open and close independently of each other and that only a very small fraction are open 
at any one time. Suppose then that the probability that a channel is open is p, a very 
small number, that there are m channels in all, and that the amount of current flowing 
through an individual channel is c. The number of channels open at any particular 
time is N, a binomial random variable with m trials and probability p of success on 
each trial. The total amount of current is $ = cN and can be measured. We then 
have 


E(S) = cE(N) = cmp 
Var(S) = c^mp(1 — p) 





and 
Var(S 
(5) =c¢c(l-p)%c 
E(S) 
since p is small. Thus, through independent measurements, $1, ..., Sn, we can esti- 


mate E (S) and Var(S) and therefore c, the amount of current flowing through a single 
channel, without knowing how many channels there are. E 
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5.3 Convergence in Distribution 


EXAMPLEA 


and the Central Limit Theorem 


In applications, we often want to find P(a « X < b) when we do not know the 
cdf of X precisely; it is sometimes possible to do this by approximating Fy. The 
approximation is often arrived at by some sort of limiting argument. The most famous 
limit theorem in probability theory is the central limit theorem, which is the main 
topic of this section. Before discussing the central limit theorem, we develop some 
introductory terminology, theory, and examples. 


DEFINITION 


Let X1, X2,... be a sequence of random variables with cumulative distribution 
functions Fi, F5,..., and let X be a random variable with distribution function 
F. We say that X, converges in distribution to X if 


Win. Je (ox) em JF Gs) 


at every point at which F is continuous. a 


Moment-generating functions are often useful for establishing the convergence 
of distribution functions. We know from Property A of Section 4.5 that a distribu- 
tion function F, is uniquely determined by its mgf, M,. The following theorem, 
which we give without proof, states that this unique determination holds for limits 
as well. 


THEOREM A Continuity Theorem 


Let F, be a sequence of cumulative distribution functions with the corresponding 
moment-generating function M,,. Let F be acumulative distribution function with 
the moment-generating function M. If M,(t) — M (t) for allt in an open interval 
containing zero, then F,(x) — F(x) at all continuity points of F. L| 


We will show that the Poisson distribution can be approximated by the normal distri- 
bution for large values of A. This is suggested by examining Figure 2.6, which shows 
that as A increases, the probability mass function of the Poisson distribution becomes 
more symmetric and bell-shaped. 

Let ài, 45, ... be an increasing sequence with A, — oo, and let {X,} be a 
sequence of Poisson random variables with the corresponding parameters. We know 
that E(X,) = Var(X,) = àn. If we wish to approximate the Poisson distribution 
function by a normal distribution function, the normal must have the same mean and 
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variance as the Poisson does. In addition, if we wish to prove a limiting result, we run 
into the difficulty that the mean and variance are tending to infinity. This difficulty is 
dealt with by standardizing the random variables—that is, by letting 


X, m E(X,) 
A/ Var(X,) 
X, n An 
VAn 
We then have E(Z,) = 0 and Var(Z,) = 1, and we will show that the mgf of Z, 


converges to the mgf of the standard normal distribution. 
The mgf of X, is 


Zn = 


My, (t) = e=) 


By Property C of Section 4.5, the mgf of Z, is 





t 
Mz (t) = e" My ( 
n n NT 


— gta gralem —1) 


It will be easier to work with the log of this expression. 


log Mz, (t) = -t V An + A, (e! Pn D 1) 


: : : " k 
Using the power series expansion e* = $7. 4, we see that 


N 


~ 


lim log Mz, (t) = 


n—> 


t» 


or " 
lim Mz, (r = e^? 


n— oo 


The last expression is the mgf of the standard normal distribution. 

We have shown that a standardized Poisson random variable converges in distri- 
bution to a standard normal variable as à approaches infinity. Practically, we wish to 
use this limiting result as a basis for an approximation for large but finite values of ,,. 
How adequate the approximation is for A = 100, say, is a matter for theoretical and/or 
empirical investigation. It turns out that the approximation is increasingly good for 
large values of A and that A does not have to be all that large. (See Problem 8 at the 
end of this chapter.) a 
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The next example shows how the approximation of the Poisson distribution can 
be applied in a specific case. 


A certain type of particle is emitted at a rate of 900 per hour. What is the probability 
that more than 950 particles will be emitted in a given hour if the counts form a 
Poisson process? 

Let X be a Poisson random variable with mean 900. We find P(X > 950) by 
standardizing: 





X —900  950— 900 
P(X > 950) = P ( ) 


> 
4/900 4/900 


vi-e() 


= .04779 


where ® is the standard normal cdf. For comparison, the exact probability is 
04712. H 


We now turn to the central limit theorem, which is concerned with a limiting 
property of sums of random variables. If X1, X5, ... is a sequence of independent 
random variables with mean jz and variance c?, and if 


x 
i-l 


we know from the law of large numbers that S, /n converges to u in probability. This 
followed from the fact that 


Sy 1 c? 
Var = — Var(Sn) = — > 0 
n n n 


The central limit theorem is concerned not with the fact that the ratio S, /n converges 
to u but with how it fluctuates around u. To analyze these fluctuations, we stan- 
dardize: 


— S, Tne 


oyn 


Zn 


You should verify that Z, has mean 0 and variance 1. The central limit theorem states 
that the distribution of Z, converges to the standard normal distribution. 
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THEOREM B Central Limit Theorem 


Let Xi, X5, ... be a sequence of independent random variables having mean 0. 
and variance o? and the common distribution function F and moment-generating 
function M defined in a neighborhood of zero. Let 


SONT 
i=l 
Then 
lim P D $(x) 
im —— <x | = (x), —oo «x < 
n— oo o/n — 


Proof 


Let Z, = S,/(o/n). We will show that the mgf of Z, tends to the mgf of the 
standard normal distribution. Since S, is a sum of independent random variables, 


Ms,(t) = [M (OT 


vn rao) 
OA/n 


M (s) has a Taylor series expansion about zero: 
M(s) = M(0) + sM'(0) + 25? M"(0) + & 


where &,/5? — 0 as s — 0. Since E(X) = 0, M'(0) = 0, and M"(0) = o?. As 
n — oo,t/(o y/n) > 0, and 


m e E (as E 
m cO — = En 
a/n 2 avn 


where &,/(12/(no?)) > 0 as n — oo. We thus have 


and 





t2 n 
Mz, (t) = ( se s) 
2n 
It can be shown that if a, — a, then 


. an n 
lim (1 + x) =g 
n 


noo 


From this result, it follows that 
Mz,(t) > e? asn > oo 


where exp(t7/2) is the mgf of the standard normal distribution, as was to be 
shown. n 


Theorem B is one of the simplest versions of the central limit theorem; there are 
many central limit theorems of various degrees of abstraction and generality. We have 
proved Theorem B under the assumption that the moment-generating functions exist, 
which is a rather strong assumption. By using characteristic functions instead, we 
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could modify the proof so that it would only be necessary that first and second mo- 
ments exist. Further generalizations weaken the assumption that the X; have the same 
distribution and apply to linear combinations of independent random variables. Cen- 
tral limit theorems have also been proved that weaken the independence assumption 
and allow the X; to be dependent but not “too” dependent. 

For practical purposes, especially for statistics, the limiting result in itself is not of 
primary interest. Statisticians are more interested in its use as an approximation with 
finite values of n. It is impossible to give a concise and definitive statement of how 
good the approximation is, but some general guidelines are available, and examining 
special cases can give insight. How fast the approximation becomes good depends 
on the distribution of the summands, the X;. If the distribution is fairly symmetric 
and has tails that die off rapidly, the approximation becomes good for relatively small 
values of n. If the distribution is very skewed or if the tails die down very slowly, a 
larger value of n is needed for a good approximation. The following examples deal 
with two special cases. 


Because the uniform distribution on [0, 1] has mean i and variance + the sum of 
12 uniform random variables, minus 6, has mean 0 and variance 1. The distribution 
of this sum is quite close to normal; in fact, before better algorithms were developed, 
it was commonly used in computers for generating normal random variables from 
uniform ones. It is possible to compare the real and approximate distributions analyt- 
ically, but we will content ourselves with a simple demonstration. Figure 5.1 shows 
a histogram of 1000 such sums with a superimposed normal density function. The fit 
is surprisingly good, especially considering that 12 is not usually regarded as a large 
value of n. L| 
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FIGURE 5.1 A histogram of 1000 values, each of which is the sum of 12 uniform 
[-5, J pseudorandom variables, with an approximating standard normal density. 
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EXAMPLED 


EXAMPLEE 
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The sum of n independent exponential random variables with parameter A = 1 
follows a gamma distribution with à = 1 and æ = n (Example F in Section 4.5). The 
exponential density is quite skewed; therefore, a good approximation of a standardized 
gamma by a standardized normal would not be expected for small n. Figure 5.2 shows 
the cdf's of the standard normal and standardized gamma distributions for increasing 
values of n. Note how the approximation improves as n increases. E 


1.0 F 


Cumulative probability 











FIGURE 5.2 The standard normal cdf (solid line) and the cdf's of standardized 
gamma distributions with œ = 5 (long dashes), œ = 10 (short dashes), and œ = 30 (dots). 


Let us now consider some applications of the central limit theorem. 


Measurement Error 

Suppose that X,,..., X, are repeated, independent measurements of a quantity, 4, 
and that E(X;) = u and Var(X;) = o°. The average of the measurements, X, is 
used as an estimate of u. The law of large numbers tells us that X converges to 
u in probability, so we can hope that X is close to yz if n is large. Chebyshev’s 
inequality allows us to bound the probability of an error of a given size, but the 
central limit theorem gives a much sharper approximation to the actual error. Suppose 
that we wish to find P(|X — u| < c) for some constant c. To use the central limit 
theorem to approximate this probability, we first standardize, using E(X) = u and 
Var(X) = o? /n: 


P(X—gu|«c)2P(-c«X-—pu«c) 





EC Es X-u C 
~ ee < an ae 


TORE 
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For example, suppose that 16 measurements are taken with o = 1. The proba- 
bility that the average deviates from jz by less than .5 is approximately 


P(|X — u| < .5) = ®(.5 x 4) — 6(—.5 x 4) 2.954 


This sort of reasoning can be turned around. That is, given c and y, n can be found 
such that 


PX-u|«ozy E 


Normal Approximation to the Binomial Distribution 
Since a binomial random variable is the sum of independent Bernoulli random vari- 
ables, its distribution can be approximated by a normal distribution. The approxima- 
tion is best when the binomial distribution is symmetric—that is, when p = 1. A 
frequently used rule of thumb is that the approximation is reasonable when np > 5 
and n(1 — p) > 5. The approximation is especially useful for large values of n, for 
which tables are not readily available. 

Suppose that a coin is tossed 100 times and lands heads up 60 times. Should we 
be surprised and doubt that the coin is fair? 

To answer this question, we note that if the coin is fair, the number of heads, X, is 
a binomial random variable with n = 100 trials and probability of success p = 1 so 
that E(X) = np = 50 (see Example A of Section 4.1) and Var(X) = np(1— p) = 25 
(see Example B of Section 4.3). We could calculate P(X = 60), which would be a 
small number. But because there are so many possible outcomes, P(X = 50) is also 
a small number, so this calculation would not really answer the question. Instead, we 
calculate the probability of a deviation as extreme as or more extreme than 60 if the 
coin is fair; that is, we calculate P(X > 60). To approximate this probability from 
the normal distribution, we standardize: 





X—50  60—50 
P(X > 60) = P > 


5 5 5 
1 — 0(2) 
= .0228 


Q 


The probability is rather small, so the fairness of the coin is called into question. B 


Particle Size Distribution 
The distribution of the sizes of grains of particulate matter is often found to be quite 
skewed, with a slowly decreasing right tail. A distribution called the lognormal is 
sometimes fit to such a distribution, and X is said to follow a lognormal distribution if 
log X has a normal distribution. The central limit theorem gives a theoretical rationale 
for the use of the lognormal distribution in some situations. 

Suppose that a particle of initial size yo is subjected to repeated impacts, that on 
each impact a proportion, X;, of the particle remains, and that the X; are modeled as 
independent random variables having the same distribution. After the first impact, the 
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size of the particle is Y; = X; yo; after the second impact, the size is Yo = X2X 1 yo; 
and after the nth impact, the size is 


Y, = Xn Xj did X»Xiyo 
Then 
log Y, = log yo + log X; 
i=l 


and the central limit theorem applies to log Y,,. a 


A similar construction is relevant to the theory of finance. Suppose that an initial 
investment of value vp is made and that returns occur in discrete time, for example, 
daily. If the return on the first day is R4, then the value becomes V; = Rvo. After 
day two the value is V; = R5R;vo, and after day n the value is 


V, = Ri Rn-1 rig Riv, 


The log value is thus 


log V, = log vo + y log R; 
i=l 
If the returns are independent random variables with the same distribution, then the 
distribution of log V, is approximately normally distributed. 


5.4 Problems 


1. Let X,, X2,...beasequence of independent random variables with E(X;) = u 
and Var(X;) = 07. Show thatif n ? $7; , o? — 0, then X — u in probability. 


2. Let X; be as in Problem 1 but with E(X;) = u; and n ! $5; , ui — u. Show 
that X — n in probability. 


3. Suppose that the number of insurance claims, N, filed in a year is Poisson 
distributed with E(N) = 10,000. Use the normal approximation to the Poisson 
to approximate P(N > 10,200). 


4. Suppose that the number of traffic accidents, N, in a given period of time is dis- 
tributed as a Poisson random variable with E (N) = 100. Use the normal approx- 
imation to the Poisson to find A such that P(100 — A < N < 100+ A) « .9. 


5. Using moment-generating functions, show that as n — oo, p — 0, and 
np — i, the binomial distribution with parameters n and p tends to the Poisson 
distribution. 


6. Using moment-generating functions, show that as œ — oo the gamma distri- 
bution with parameters o and A, properly standardized, tends to the standard 
normal distribution. 


7. Show that if X, — c in probability and if g is a continuous function, then 
g(X,) — g(c) in probability. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 
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. Compare the Poisson cdf and the normal approximation for (a) A = 10, 
(b) A = 20, and (c) A = 40. 
. Compare the binomial cdf and the normal approximation for (a) n = 20 and 


p = 2, and (b) n = 40 and p = .5. 


A six-sided die is rolled 100 times. Using the normal approximation, find the 
probability that the face showing a six turns up between 15 and 20 times. 
Find the probability that the sum of the face values of the 100 trials is less 
than 300. 


A skeptic gives the following argument to show that there must be a flaw in 
the central limit theorem: “We know that the sum of independent Poisson ran- 
dom variables follows a Poisson distribution with a parameter that is the sum 
of the parameters of the summands. In particular, if n independent Poisson 
random variables, each with parameter n^, are summed, the sum has a Pois- 
son distribution with parameter 1. The central limit theorem says that as n 
approaches infinity, the distribution of the sum tends to a normal distribution, 
but the Poisson with parameter 1 is not the normal." What do you think of this 
argument? 


The central limit theorem can be used to analyze round-off error. Suppose that 
the round-off error is represented as a uniform random variable on [— L, 1]. If 
100 numbers are added, approximate the probability that the round-off error 
exceeds (a) 1, (b) 2, and (c) 5. 


A drunkard executes a "random walk" in the following way: Each minute he 
takes a step north or south, with probability 1 each, and his successive step 
directions are independent. His step length is 50 cm. Use the central limit theo- 
rem to approximate the probability distribution of his location after 1 h. Where 
is he most likely to be? 


Answer Problem 13 under the assumption that the drunkard has some idea of 
where he wants to go so that he steps north with probability i and south with 
probability i. 


Suppose that you bet $5 on each of a sequence of 50 independent fair games. 
Use the central limit theorem to approximate the probability that you will lose 
more than $75. 


Suppose that X1, ..., X29 are independent random variables with density func- 
tions 


f(x) 22x, O<x<1 
Let S = X, +--+ + X». Use the central limit theorem to approximate 


P(S < 10). 


Suppose that a measurement has mean p and variance o? = 25. Let X be the 
average of n such independent measurements. How large should n be so that 
P(|X — u| < 1) = .95? 


190 


Chapter 5 Limit Theorems 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


Suppose that a company ships packages that are variable in weight, with an aver- 
age weight of 15 Ib and a standard deviation of 10. Assuming that the packages 
come from a large number of different customers so that it is reasonable to 
model their weights as independent random variables, find the probability that 
100 packages will have a total weight exceeding 1700 Ib. 


a. Use the Monte Carlo method with n = 100 and n = 1000 to estimate 
h cos(2zt x) dx. Compare the estimates to the exact answer. 

b. Use Monte Carlo to evaluate i, cos(2zt x?) dx. Can you find the exact 
answer? 


What is the variance of the estimate of an integral by the Monte Carlo method 
(Example A of Section 5.2)? [Hint: Find E d ?( f)).] Compare the standard 
deviations of the estimates of part (a) of previous problem to the actual errors 
you made. 


This problem introduces a variation on the Monte Carlo integration technique 
of Example A of Section 5.2. Suppose that we wish to evaluate 


b 
I(f) = / fori 


Let g be a density function on [a, b]. Generate X1,- --, X,, from g and estimate 
I by 


? I X; 
pets ) 


n 





a. Show that E(/(f)) = I(f). 

b. Find an expression for Var(1 (f)). Give an example for which it is finite and 
an example for which it is infinite. Note that if it is finite, the law of large 
numbers implies that icf) — I(f)asn > oo. 

c. Show that if a = 0, b = 1, and g is uniform, this is the same Monte Carlo 
estimate as that of Example A of Section 5.2. Can this estimate be improved 
by choosing g to be other than uniform? (Hint: Compare variances.) 


Use the central limit theorem to find A such that P( i (D-II < A= 
.05, where i(f) is the Monte Carlo estimate of Ío cos(2ztx) dx based on 
1000 points. 


An irregularly shaped object of unknown area A is located in the unit square 
0 <x < 1,0 < y< 1. Consider a random point distributed uniformly over the 
square; let Z = 1 if the point lies inside the object and Z = 0 otherwise. Show 
that E(Z) = A. How could A be estimated from a sequence of n independent 
points uniformly distributed on the square? 


How could the central limit theorem be used to gauge the probable size of the 
error of the estimate of the previous problem? Denoting the estimate by A, if 
A = .2, how large should n be so that P(|A — A| < .01) ~ .99? 


Let X be a continuous random variable with density function f(x) = 3x?, -l< 
x < 1. Sketch this density function. Use the central limit theorem to sketch 


26. 


27. 
28. 


29. 


30. 
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the approximate density function of S = X, +---+ X59, where the X; are 
independent random variables with density f. Similarly, sketch the approxi- 
mate density functions of $/50 and S/4/50. For each sketch, label at least three 
points on the horizontal axis. 


Suppose that a basketball player can score on a particular shot with probabil- 
ity .3. Use the central limit theorem to find the approximate distribution of S, 
the number of successes out of 25 independent shots. Find the approximate 
probabilities that S is less than or equal to 5, 7, 9, and 11 and compare these to 
the exact probabilities. 


Prove that if a, — a, then (1 + a,/n)" — e°. 





Let f, be a sequence of frequency functions with f,(x) = i if X — +G)” 
and f,(x) = 0 otherwise. Show that lim f, (x) = 0 for all x, which means that 
the frequency functions do not converge to a frequency function, but that there 
exists a cdf F such that lim F, (x) = F(x). 


In addition to limit theorems that deal with sums, there are limit theorems that 
deal with extreme values such as maxima or minima. Here is an example. Let 
U,, ..., Un be independent uniform random variables on [0, 1], and let Um) be 
the maximum. Find the cdf of U) and a standardized Ug), and show that the 
cdf of the standardized variable tends to a limiting value. 


Generate a sequence Uj, U5, ..., Ujooo of independent uniform random vari- 
ables on a computer. Let S, = $7; , U; for n = 1,2,..., 1000. Plot each of 
the following versus n: 


a. S, 

b. S,/n 

c. $,—n/2 

d. (S, — n/2)/n 
e. (S, — n/2)/ n 


Explain the shapes of the resulting graphs using the concepts of this chapter. 


CHAPTER 6 


Distributions Derived 
from the Normal 
Distribution 


6.1 Introduction 


This chapter assembles some results concerning three probability distributions derived 
from the normal distribution—the x ? t,and F distributions. These distributions occur 
in many statistical problems and will be used in later chapters. 


6.2 x?^,t, and F Distributions 


DEFINITION 
If Z is a standard normal random variable, the distribution of U — Z? is called 
the chi-square distribution with 1 degree of freedom. a 


We have already encountered the chi-square distribution in Section 2.3, where 
we saw that it is a special case of the gamma distribution with parameters } and }. 
The chi-square distribution with 1 degree of freedom is denoted x?. It is useful to note 
thatif X ~ N (u, o°), then (X — u)/o ~ N(O, 1), and therefore [(X — u)/o P. ~ x2. 


DEFINITION 


If Ui, U5, ..., Un are independent chi-square random variables with 1 degree of 
freedom, the distribution of V = U, + U2 +---+ U, is called the chi-square 
distribution with n degrees of freedom and is denoted by x7. El 
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From Example F in Section 4.5, we know that the sum of independent gamma 
random variables that have the same value of à follows a gamma distribution, and 
therefore the chi-square distribution with n degrees of freedom is a gamma distribution 
with œ = n/2 and à = P Its density is 


1 
= (n/2)—1 ,—v/2 > 0 
fv) TATAD” g Wm v> 


Its moment-generating function is 
M(t) = (1 —210)"? 


Also, E(V) = n and Var(V) = 2n. To indicate that V follows a chi-square distribution 
with n degrees of freedom, we write V ^ x2. A notable consequence of the definition 
of the chi-square distribution is that if U and V are independent and U ~ x? and 
V ^ x2, then U +V ~ x2 ,,. 

We now turn to the 7 distribution. 


DEFINITION 


If Z ^ N(O, 1) and U — x: and Z and U are independent, then the distribution 
of Z/A/U /n is called the t distribution with n degrees of freedom. L| 


PROPOSITION A 


The density function of the f distribution with n degrees of freedom is 


. TI + 0/2] ( aa 
—oJnzT(n/2) 





f(t) 


n 


Proof 


This is proved by a standard method. The density function of y U /n is straight- 
forward to obtain, and the density function of the quotient of two independent 
random variables was derived in Section 3.6.1. The details of the proof are left 
as an end-of-chapter problem. L| 


From the density function of Proposition A, f (t) = f(—t), so the ¢ distribution 
is symmetric about zero. As the number of degrees of freedom approaches infinity, 
the ¢ distribution tends to the standard normal distribution; in fact, for more than 20 
or 30 degrees of freedom, the distributions are very close. Figure 6.1 shows several f 
densities. Note that the tails become lighter as the degrees of freedom increase. 
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= 0 1 


Three t densities with 5 (long dashes), 10 (short dashes), and 30 (dots) 


FIGURE 6.1 
degrees of freedom and the standard normal density (solid line). 


DEFINITION 
Let U and V be independent chi-square random variables with m and n degrees 


of freedom, respectively. The distribution of 
| U/m 


Vin 
is called the F distribution with m and n degrees of freedom and is denoted by 
| 





Io 


PROPOSITION B 


The density function of W is given by 
IP 2 m/2 —(m-+n)/2 
fuc ue ie g= (1+ =w) > w20 
G AA \n n 





Proof 
W is the ratio of two independent random variables, and its density follows from 
the results given in Section 3.6.1. El 

It can be shown that, for n > 2, E(W) exists and equals n/(n — 2). From the 


definitions of the t and F distributions, it follows that the square of a t, random 
variable follows an F; ,, distribution (see Problem 6 at the end of this chapter). 


6.3 
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The Sample Mean and the Sample Variance 


Let X,,..., X, be independent N (m, c?) random variables; we sometimes refer to 
them as a sample from a normal distribution. In this section, we will find the joint 
and marginal distributions of 





These are called the sample mean and the sample variance, respectively. First note 
that because X is a linear combination of independent normal random variables, it is 
normally distributed with 


E(X) =u 
= co? 
Var(X) = — 
n 


As a preliminary to showing that X and S? are independently distributed, we 
establish the following theorem. 


THEOREM A 


The random variable X and the vector of random variables (X, — X, X; — 
X,..., X, — X) are independent. 


Proof 


At the level of this course, it is difficult to give a proof that provides sufficient 
insight into why this result is true; a rigorous proof essentially depends on geo- 
metric properties of the multivariate normal distribution, which this book does not 
cover. We present a proof based on moment-generating functions; in particular, 
we will show that the joint moment-generating function 


M(s,ti,...,t,) = ElexplsX + (Xi — X) +--+ + ta (Xn — XI} 


factors into the product of two moment-generating functions—one of X and the 
other of (X; — X),..., (X, — X). The factoring implies (Section 4.5) that the 
random variables are independent of each other and is accomplished through 
some algebraic trickery. First we observe that since 


NX: E = ee = nXt 
ridi 


gl 
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then 


Yu 3-2 [E+] 


i=l pil 


n 
= X di Xi 
i=l 


where 


a; = (E) 


n 


Furthermore, we observe that 


Now we have 


WA Sa tite co an the) = Whe core (Cho o ooo Gin) 


and since the X; are independent normal random variables, we have 


M (s, ty, ace) = [[“.@ 


P 


= flex (na ar Tap) 
= exp (Za F Zya) 
i=1 i=1 


= exp 


o? s? co? z E 
dee NE 
us + 5 GESI ) 





gr c? L 
= 2 252 
= exp (us + s ) exp B X a-n] 


i=l 


The first factor is the mgf of X. Since the mgf of the vector (X1 — X,..., X, — X) 
can be obtained by setting s = 0 in M, the second factor is this mgf. E 
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COROLLARY A 
X and S? are independently distributed. 


Proof 


This follows immediately since S? is a function of the vector (X, — X,..., 
X, — X), which is independent of X. E 


The next theorem gives the marginal distribution of $°. 


THEOREM B 


The distribution of (n — 1) S?/o? is the chi-square distribution with n — 1 degrees 
of freedom. 


Proof 
We first note that 


I 3 X;—pu : 3 
2003 CQ = (=) 29 
=i 


n 
nul 


Also, 
1 n 1 n ZN " 5 
2: 2,06 - uy = SD - 3) - (C- wl 
i=l i=l 
Expanding the square and using the fact that $7 , (X; — X) = 0, we obtain 


1 n w 1 n ae Ke 2 
pP UD orem + (=) 


This is a relation of the form W = U + V. Since U and V are independent 
by Corollary A, Mw (t) = My (t)My (t). W and V both follow chi-square distri- 
butions, so 





My (t) 
My(t) 
(1 — 21)" 
~ (10-20-12 
= onus 


My(t) — 





The last expression is the mgf of a random variable with a x2 , distribution. WI 


One final result concludes this chapter's collection. 
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COROLLARY B 


Let X and S? be as given at the beginning of this section. Then 


A equ 


— S: 
S/ /n j 


Proof 


We simply express the given ratio in a different form: 





(mus) 
X-u  NXo/ n 
S//n — So? 





The latter is the ratio of an N(0, 1) random variable to the square root of an 
independent random variable with a x2 , distribution divided by its degrees of 
freedom. Thus, from the definition in Section 6.2, the ratio follows a t distribution 
with n — 1 degrees of freedom. H 


6.4 Problems 


= 


Prove Proposition A of Section 6.2. 


. Prove Proposition B of Section 6.2. 


. Let X be the average of a sample of 16 independent normal random variables 


with mean 0 and variance 1. Determine c such that 


P(IX| « c) 2.5 


. If T follows a t; distribution, find f such that (a) P(|T| < tọ) = .9 and 


(b) P(T > to) = .05. 


. Show that if X ~ F, m, then X^! ~ F,,. 
. Show that if T ~ t, then T? ~ Fin. 


. Show that the Cauchy distribution and the ¢ distribution with 1 degree of free- 


dom are the same. 


. Show that if X and Y are independent exponential random variables with A = 1, 


then X/Y follows an F distribution. Also, identify the degrees of freedom. 


. Find the mean and variance of 5?, where S? is as in Section 6.3. 
10. 
11. 


Show how to use the chi-square distribution to calculate P(a < $?/o? < b). 


Let X;,..., X, be a sample from an N (ux, o?) distribution and Y;,..., Y, be 
an independent sample from an N (uy, 0?) distribution. Show how to use the 
F distribution to find P(S2/8? > c). 


7.1 


CHAPTER 7 


Survey Sampling 


Introduction 


Resting on the probabilistic foundations of the preceding chapters, this chapter marks 
the beginning of our study of statistics by introducing the subject of survey sampling. 
As well as being of considerable intrinsic interest and practical utility, the development 
of the elementary theory of survey sampling serves to introduce several concepts and 
techniques that will recur and be amplified in later chapters. 

Sample surveys are used to obtain information about a large population by exam- 
ining only a small fraction of that population. Sampling techniques have been used 
in many fields, such as the following: 


Governments survey human populations; for example, the U.S. government con- 
ducts health surveys and census surveys. 

Sampling techniques have been extensively employed in agriculture to estimate 
such quantities as the total acreage of wheat in a state by surveying a sample of 
farms. 

The Interstate Commerce Commission has carried out sampling studies of rail and 
highway traffic. In one such study, records of shipments of household goods by 
motor carriers were sampled to evaluate the accuracy of preshipment estimates of 
charges, claims for damages, and other variables. 

In the practice of quality control, the output of a manufacturing process may be 
sampled in order to examine the items for defects. 

During audits of the financial records of large companies, sampling techniques may 
be used when examination of the entire set of records is impractical. 


The sampling techniques discussed here are probabilistic in nature—each mem- 
ber of the population has a specified probability of being included in the sample, and 
the actual composition of the sample is random. Such techniques differ markedly from 
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the type of sampling scheme in which particular population members are included 
in the sample because the investigator thinks they are typical in some way. Such a 
scheme may be effective in some situations, but there is no way mathematically to 
guarantee its unbiasedness (a term that will be precisely defined later) or to estimate 
the magnitude of any error committed, such as that arising from estimating the popu- 
lation mean by the sample mean. We will see that using a random sampling technique 
has a consequence that estimates can be guaranteed to be unbiased and probabilistic 
bounds on errors can be calculated. Among the advantages of using random sampling 
are the following: 


The selection of sample units at random is a guard against investigator biases, even 
unconscious ones. 

A small sample costs far less and is much faster to survey than a complete enumer- 
ation. 

The results from a small sample may actually be more accurate than those from a 
complete enumeration. The quality of the data in a small sample can be more easily 
monitored and controlled, and a complete enumeration may require a much larger, 
and therefore perhaps more poorly trained, staff. 

Random sampling techniques make possible the calculation of an estimate of the 
error due to sampling. 

In designing a sample, it is frequently possible to determine the sample size neces- 
sary to obtain a prescribed error level. 


Peck et al. (2005) contains several interesting papers about applications of 
sampling. 


Population Parameters 


This section defines those numerical characteristics, or parameters, of the population 
that we will estimate from a sample. We will assume that the population is of size 
N and that associated with each member of the population is a numerical value of 
interest. These numerical values will be denoted by x1, x», © + +, xy. The variable x; 
may be a numerical variable such as age or weight, or it may take on the value 1 or 
0 to denote the presence or absence of some characteristic such as gender. We will 
refer to the latter situation as the dichotomous case. 


This is the first of many examples in this chapter in which we will illustrate ideas 
by using a study by Herkson (1976). The population consists of N = 393 short- 
stay hospitals. We will let x; denote the number of patients discharged from the ith 
hospital during January 1968. A histogram of the population values is shown in Fig- 
ure 7.1. The histogram was constructed in the following way: The number of hospitals 
that discharged 0-200, 201— 400, . .. , 2801—3000 patients were graphed as horizon- 
tal lines above the respective intervals. For example, the figure indicates that about 
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FIGURE 7.1 Histogram of the numbers of patients discharged during January 1968 
from 393 short-stay hospitals. 


40 hospitals discharged from 601 to 800 patients. The histogram is a convenient 
graphical representation of the distribution of the values in the population, being 
more quickly assimilated than would a list of 393 values. a 


We will be particularly interested in the population mean, or average, 


1 
H= Xi 
we 


For the population of 393 hospitals, the mean number of discharges is 814.6. Note 
the location of this value in Figure 7.1. In the dichotomous case, where the presence 
or absence of a characteristic is to be determined, u equals the proportion, p, of 
individuals in the population having the particular characteristic, because in the sum 
above, each x; is either 0 or 1. The sum thus reduces to the number of 1s and when 
divided by N, gives the proportion, p. 

The population total is 


N 
T= 5 xi = Nu 
i-l 
The total number of people discharged from the population of hospitals is t = 
320,138. In the dichotomous case, the population total is the total number of members 
of the population possessing the characteristic of interest. 
We will also need to consider the population variance, 


1 N 
2 se 2 
c = p) 
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A useful identity can be obtained by expanding the square in this equation: 


Í 
z|- 
wA I ME 

M= 
A 
| 
N 
z 
Fa 
+ 
z 
* 
pM 


In the dichotomous case, the population variance reduces to p(1 — p): 


1 N 
2 2 2 
o = — Ap = 
x2. POM 


=p- p 
= p(l- p) 


Here we used the fact that because each x; is O or 1, each a is also O or 1. 

The population standard deviation is the square root of the population variance 
andis used as a measure of how spread out, dispersed, or scattered the individual values 
are. The standard deviation is given in the same units (for example, inches) as are the 
population values, whereas the variance is given in those units squared. The variance 
of the discharges is 347,766, and the standard deviation is 589.7; examination of 
the histogram in Figure 7.1 makes it clear that the latter number is the more reasonable 
description of the spread of the population values. 


Simple Random Sampling 


The most elementary form of sampling is simple random sampling (s.r.s.): Each 
particular sample of size n has the same probability of occurrence; that is, each of the 
(> ) possible samples of size n taken without replacement has the same probability. 
We assume that sampling is done without replacement so that each member of the 
population will appear in the sample at most once. The actual composition of the 
sample is usually determined by using a table of random numbers or a random number 
generator on a computer. Conceptually, we can regard the population members as 
balls in an urn, a specified number of which are selected for inclusion in the sample 
at random and without replacement. 

Because the composition of the sample is random, the sample mean is random. 
An analysis of the accuracy with which the sample mean approximates the population 
mean must therefore be probabilistic in nature. In this section, we will derive some 
statistical properties of the sample mean. 


7.3.1 


EXAMPLEA 
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The Expectation and Variance of the Sample Mean 


We will denote the sample size by n (n is less than N) and the values of the sample 

members by X1, X», ..., Xn. It is important to realize that each X; is a random vari- 

able. In particular, X; is not the same as x;: X; is the value ofthe ith member of the sam- 

ple, which is random and x; is that of the ith member of the population, which is fixed. 
We will consider the sample mean, 


= Uus 
X--» X 


as an estimate of the population mean. As an estimate of the population total, we will 
consider 


T=NX 


Properties of T will follow readily from those of X. Since each X; is a random 
variable, so is the sample mean; its probability distribution is called its sampling 
distribution. In general, any numerical value, or statistic, computed from a random 
sample is a random variable and has an associated sampling distribution. The sampling 
distribution of X determines how accurately X estimates jz; roughly speaking, the 
more tightly the sampling distribution is centered on jz, the better the estimate. 


To illustrate the concept of a sampling distribution, let us look again at the population 
of 393 hospitals. In practice, of course, the population would not be known, and only 
one sample would be drawn. For pedagogical purposes here, we can consider the 
sampling distribution of the sample mean from this known population. Suppose, for 
example, that we want to find the sampling distribution of the mean of a sample of size 
16. In principle, we could form all le A ) samples and compute the mean of each one— 
this would give the sampling distribution. But because the number of such samples is 
of the order 10%, this is clearly not practical. We will thus employ a technique known 
as simulation. We can estimate the sampling distribution of the mean of a sample of 
size n by drawing many samples of size n, computing the mean of each sample, and 
then forming a histogram of the collection of sample means. Figure 7.2 shows the 
results of such a simulation for sample sizes of 8, 16, 32, and 64 with 500 replications 
for each sample size. Three features of Figure 7.2 are noteworthy: 


1. All the histograms are centered about the population mean, 814.6. 

2. As the sample size increases, the histograms become less spread out. 

3. Although the shape of the histogram of population values (Figure 7.1) is not 
symmetric about the mean, the histograms in Figure 7.2 are more nearly so. 


These features will be explained quantitatively. E 


As we have said, X is a random variable whose distribution is determined by 
that of the X;. We thus examine the distribution of a single sample element, X;. It 
should be noted that the following lemma holds whether sampling is with or without 
replacement. 
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FIGURE 7.2 Histograms of the values of the mean number of discharges in 500 
simple random samples from the population of 393 hospitals. Sample sizes: (a) n — 8, 
(b) n = 16, (c) n 2 32, (d) n = 64. 


We need to be careful about the values that the random variable X; can assume. 
The i^" sample member is equally likely to be any of the N population members. If 
all the population values were distinct, we would then have P(X; = xj) = 1/N. 
But the population values may not be distinct (for example, in the dichotomous case 
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there are only two values, 0 and 1). If k members of the population have the same 
value ¢, then P(X; = £) = k/N. We use this construction in proving the following 
lemma. 


LEMMA A 
Denote the distinct values assumed by the population members by ¢1, £5, ..., Sm, 
and denote the number of population members that have the value £; by n;, j — 
1,2,...,m. Then X; is a discrete random variable with probability mass 
function 
P(X =o) = 2 
oa J PER: N 
Also, 
E(X)=u 
Var(X;) = o? 
Proof 
The only possible values that X; can assume are £1, £5, ..., (4. Since each mem- 


ber of the population is equally likely to be the ith member of the sample, the 
probability that X; assumes the value £; is thus n; /N. The expected value of the 
random variable X; is then 


m 


m 1 
E(X) 2$ GPX Tt) v o mtn 
j=" 


iE 


The last equation follows because n ; population members have the value £; 
and the sum is thus equal to the sum of the values of all the population members. 
Finally, 


Var(X;) = E(X?) — [E(X) 


1 m 5 3 
E 
j=l 
= 2 
Here we have used the fact that $^ x7 = 57 7 4;07 and the identity for the 
population variance derived in Section 7.2. L| 


As a measure of the center of the sampling distribution, we will use E(X). Asa 
measure of the dispersion of the sampling distribution about this center, we will use 
the standard deviation of X. The key results that will be obtained shortly are that the 
sampling distribution is centered at u and that its spread is inversely proportional to 
the square root of the sample size, n. We first show that the sampling distribution is 
centered at u. 
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THEOREMA 
With simple random sampling, E (X) = pw. 


Proof 


Since, from Lemma A, E(X;) = p, it follows from Theorem A in Section 4.1.2 
that 


- ls 
E(X) = — 15(QX5)) = 
(X) = = 2 (Xi) =u m 
From Theorem A, we have the following corollary. 


COROLLARY A 
With simple random sampling, E (T) = tT. 


Proof 


E(T) 2 E(NX) 
= NE(X) 


EE Bm 


In the dichotomous case, u = p, and X is the proportion of the sample that 
possesses the characteristic of interest. In this case, X will be denoted by p. We have 
shown that E( p) = p. 

It is important to keep in mind that X is random. The result E(X) = p can be 
interpreted to mean that “on the average" X = u. In general, if we wish to estimate 
a population parameter, 0 say, by a function 6 of the sample, X1, X5, ..., Xn, and 
E (0) — 0, whatever the value of 0 may be, we say that Ó is unbiased. Thus, X 
and T are unbiased estimates of u and t. On average they are correct. We next 
investigate how variable they are, by deriving their variances and standard deviations. 
Section 4.2.1 introduced the concepts of bias and variance in the context of a model 
of measurement error, and these concepts are also relevant in this new context. In 
Chapter 4, it was shown that 


Mean squared error — variance 4- bias? 


Since X and T are unbiased, their mean squared errors are equal to their variances. 
We next find Var(X). Since X = n^! $5; , Xi, it follows from Corollary A of 
Section 4.3 that 


= 1 
Var(X) — -;2,2.,Cf. Xj) 


i=1 j=l 
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Suppose that sampling were done with replacement. Then the X; would be inde- 
pendent, and for i Z j we would have Cov(X;, X;) = 0, whereas Cov(X;, X;) = 
Var(X;) — o?. It would then follow that 


_ 1 n 
Var X = m) 5 Var(X;) 
i=l 


n 


and that the standard deviation of X, also called its standard error, would be 


Sampling without replacement induces dependence among the X;, which com- 
plicates this simple result. However, we will see that if the sample size n is small 
relative to the population size N, the dependence is weak and this simple result holds 
to a good approximation. 

To find the variance of the sample mean in sampling without replacement we 
need to find Cov(X;, X;) fori Æ j. 


LEMMA B 

For simple random sampling without replacement, 
Cov(X; X;) = —o?/(N -1)  ifizj 

Using the identity for covariance established at the beginning of Section 4.3, 
ov CGE) E ENX X VETEN VEN 


and 


m m 


EX X = X Y CUP = Gand X) 
k=l Si 


m m 


zi SaPA; = Ce) >) P(X; = |X: = &) 


k=1 I 


from the multiplication law for conditional probability. Now, 


z oaa S aN — 1), nie Al 
aE E ifk—l 


Now if we express 
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the expression for E(X; X;) becomes 
Ya (Xuxtl-xhl|-xw-n[ü-Me 
"NAe"N-I N-I N(N — 1) aid 


E T E 1 m j 
"NN-D NUN-D > tin 











Ny? 1 à 5 
Se ee ee 
= ud o? 

x ye 


Finally, subtracting E(X;) E(X;) — u? from the last equation, we have 
2 
N-1 
fori Z j. E 





Cov(X;, X;) = — 


(Alternative proofs of Lemma B are outlined in Problems 25 and 26 at the end of 
this chapter.) This lemma shows that X; and X ; are not independent of each other for 
i + j, but that the covariance is very small for large values of N. We are now able to 
derive the following theorem. 


THEOREM B 


With simple random sampling, 


VeRO ES EL (7) 








Proof 
From Corollary A of Section 4.3, 


Var(X) = = ss ener Xj) 


TESTES 


1 n 1 n 
= 35 Y Var(X) + r 3 Cov(X;, Xj) 
i=1 i=1 ji 
2 2 
==- 70-1 
n pee N -1 


After some algebra, this gives the desired result. H 
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Notice that the variance of the sample mean in sampling without replacement 
differs from that in sampling with replacement by the factor 


Go 


which is called the finite population correction. The ratio n/N is called the sampling 
fraction. Frequently, the sampling fraction is very small, in which case the standard 
error (standard deviation) of X is 





o 
Jn 

We see that, apart from the usually small finite population correction, the spread of the 
sampling distribution and therefore the precision of X are determined by the sample 
size (n) and not by the population size (N ). As will be made more explicit later, 
the appropriate measure of the precision of the sample mean is its standard error, 
which is inversely proportional to the square root of the sample size. Thus, in order 
to double the accuracy, the sample size must be quadrupled. (You might examine 
Figure 7.2 with this in mind.) The other factor that determines the accuracy of the 
sample mean is the population standard deviation, o. If ø is small, the population 
values are not very dispersed and a small sample will be fairly accurate. But if the 
values are widely dispersed, a much larger sample will be required in order to attain 
the same accuracy. 


a] 
Ox ^v 


If the population of hospitals is sampled without replacement and the sample size is 
n = 32, 





" o 1 n—l1 
xo dn N-1 
589.7 31 
-Jal 392 
= 104.2 x .96 
= 100.0 


Notice that because the sampling fraction is small, the finite population correction 
makes little difference. To see that oy = 100.0 is a reasonable measure of accuracy, 
examine part (b) of Figure 7.2 and observe that the vast majority of sample means 
differed from the population mean (814) by less than two standard errors; i.e., the 
vast majority of sample means were in the interval (614, 1014). | 


Let us apply this result to the problem of estimating a proportion. In the population of 
hospitals, a proportion p = .654 had fewer than 1000 discharges. If this proportion 
were estimated from a sample as the sample proportion P, the standard error of p 
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could be found by applying Theorem B to this dichotomous case: 


VENIUNT n—1 
pe n N-1 


For example, for n = 32, the standard error of f is 


yes x .346 "T 31 
32 392 


= .08 | 














The precision of the estimate of the population total does depend on the population 
size, N. 


COROLLARY B 
With simple random sampling, 


N N= 
Var(T) = N? (=) i 
n 


N-1 





Proof 
Since T = NX, 
Var(T) = N? Var(X) H 


7.3.2 Estimation of the Population Variance 


A sample survey is used to estimate population parameters, and it is desirable also 
to assess and quantify the variability of the estimates. In the previous section, we 
saw how the standard error of an estimate may be determined from the sample size 
and the population variance. In practice, however, the population variance will not 
be known, but as we will show in this section, it can be estimated from the sample. 
Since the population variance is the average squared deviation from the population 
mean, estimating it by the average squared deviation from the sample mean seems 
natural: 
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The following theorem shows that this estimate is biased. 


THEOREM A 


With simple random sampling, 


E(62) 2o? (= - NM 





Proof 


Expanding the square and proceeding as in the identity for the population variance 
in Section 7.2, we find 


2 = 2 2 
LE z XX 
Thus, 
^2 E : 21 rry2 
Bn 2 E(X;) - E(X?) 


Now, we know that 
Var(X) + EXD 
= c? E ul 


E(X?) 


Similarly, from Theorems A and B of Section 7.3.1, 
E(X’) = Var(X) + LEGOT 


g ues 1 De 

^n N-1 [> 
Substituting these expressions for E(X?) and E(X?) in the preceding equation 
for E(6?) gives the desired result. B 





Because N > n, it follows with a little algebra that 


n—l1 N 
n N-—1 


«l1 








so that E(6?) < o?; 6? thus tends to underestimate o?. From Theorem A, we see 
that an unbiased estimate of o? may be obtained by multiplying ó? by the factor 
n(N — 1)/[(n — I) N]. Thus, an unbiased estimate of o° is + (1— +) X; (X; XV. 
We also have the following corollary. 
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COROLLARY A 


An unbiased estimate of Var(X) is 








n 
a em) 
n N 
where 
1 n 
2 2 
E X;—X 
3 n—l1 »: ) 
Proof 
Since 





an unbiased estimate of Var(X) may be obtained by substituting in an unbiased 
estimate of o?. Algebra then yields the desired result. B 


Similarly, an unbiased estimate of the variance of T, the estimator of the popu- 
lation total, is 


For the dichotomous case, in which each X; is 0 or 1, note that 


1 n = 1 n p 
N (X-X =- X-X 
= p(l—p) 
Therefore, 


2 no. ^ 
= —— PU- P) 
n—i 


Thus, as a special case of Corollary A, we have the following corollary. 


COROLLARY B 
An unbiased estimate of Var(p) is 


g= 20-4) z 
d n—l N 





In many cases, the sampling fraction, n/N, is small and may be neglected. Fur- 
thermore, it often makes little difference whether n — 1 or n is used as the divisor. 
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The quantities sy, sr, and s; are called estimated standard errors. If we knew 
them, the actual standard errors, ox, oz and o5, would be used to gauge the accuracy 
of the estimates X, T and f. If they are not known, which is the typical case, the 
estimated standard errors are used in their place. 


A simple random sample of 50 of the 393 hospitals was taken. From this sample, 
X = 938.5 (recall that, in fact, u = 814.6) and s = 614.53 (o = 590). An estimate 
of the variance of X is 





Quest (1 z) = 6592 
1 N 


The estimated standard error of X is 








sz = 81.19 
ote that the true value 1s ox = = a 1s estimated standard error 
(Note that th lue is oF = 75/1 — $5; = 78.) This estimated standard 


gives a rough idea of how accurate the value of X is; in this case, we see that the 
magnitude of the error is of the order 80, as opposed to 8 or 800, say. In fact, the error 
was 123.9, or about 1.5 sz. E 


From the same sample, the estimate of the total number of discharges in the population 
of hospitals is 


T = NX = 368,831 


Recall that the true value of the population total is 320,139. The estimated standard 
error of T is 


sr = Nsy = 31,908 


Again, this estimated standard error can be used as a rough gauge of the estimation 
error. E 


Let p be the proportion of hospitals that had fewer than 1000 discharges—that is, 
p = .654. In the sample of Example A, 26 of 50 hospitals had fewer than 1000 
discharges, so 


26 
DS = 2 
P = 50 
The variance of p is estimated by 
5(1— $ 
eo) (1- =) = .0045 


p n—1 





Thus, the estimated standard error of f is 
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Crudely, this tells us that the error of f is in the second or first decimal place—that 
we are probably not so fortunate as to have an error only in the third decimal place. 
In fact, the error was .134 or about 2 x sp. L| 


These examples show how, in simple random sampling, we can not only form 
estimates of unknown population parameters, but can also gauge the likely size of the 
errors of the estimates, by estimating their standard errors from the data in the sample. 

We have covered a lot of ground, and the presence of the finite population cor- 
rection complicates the expressions we have derived. It is thus useful to summarize 
our results in the following table: 








Population 

Parameter Estimate Variance of Estimate Estimated Variance 
ud n UN o? [N-n D eet s2 n 

" X= i lima ži atc =F (1-9) 
TR : 2 _ p(l=p) ( N— 2. pip) 

p P = sample proportion — o5 = a (=) = a (1 = x) 

T T=NX o7 = N02 sp = N?s2 

X 
o? (1 — x) s? 





where s? = $7 4(X; — Xy. 

The square roots of the entries in the third column are called standard errors, 
and the square roots of the entries in the fourth column are called estimated standard 
errors. The former depend on unknown population parameters, so the latter are used 
to gauge the accuracy of the parameter estimates. When the population is large relative 
to the sample size, the finite population correction can be ignored, simplifying the 
preceding expressions. 


The Normal Approximation to the Sampling 
Distribution of X 


We have found the mean and the standard deviation of the sampling distribution of X. 
Ideally, we would like to know the sampling distribution, since it would tell us every- 
thing we could hope to know about the accuracy of the estimate. Without knowledge 
of the population itself, however, we cannot determine the sampling distribution. In 
this section, we will use the central limit theorem to deduce an approximation to 
the sampling distribution—the normal, or Gaussian, distribution. This approximation 
will be used to find probabilistic bounds for the estimation error. 

In Section 5.3, we considered a sequence of independent and identically dis- 
tributed (i.i.d.) random variables, X;, X2, ... having the common mean and variance 
u and o*. The sample mean of X1, X5, ..., X, is 
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This sample mean has the properties 
E (Xn) =u 


and " 


T o 
Var(X,) ET 
n 


The central limit theorem says that, for a fixed number z, 





Xa =H 

P <z | — (z) asn — oo 
c/A/n 

where 6 1s the cumulative distribution function of the standard normal distribution. 

Using a more compact and suggestive notation, we have 


P (== < J > (z) 
OX, 

The context of survey sampling is not exactly like that of the central limit theorem 
as stated above—as we have seen, in sampling without replacement, the X; are not 
independent of each other, and it makes no sense to have n tend to infinity while N 
remains fixed. But other central limit theorems have been proved that are appropriate 
to the sampling context. These show that if n is large, but still small relative to N, 
then X,, the mean of a simple random sample, is approximately normally distributed. 

To demonstrate the use of the central limit theorem, we will apply it to approx- 
imate P(|X — u| < 6), the probability that the error made in estimating jz by X is 
less than some constant ô 


P(X—-pgu|x8)- P(-8 
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E 
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ER 





since ®(—z) = 1 — ®(z), from the symmetry of the standard normal distribution 
about zero. 


Let us again consider the population of 393 hospitals. The standard deviation of the 
mean of a sample of size n = 64 is, using the finite population correction, 





[oi 1 n—i 
Ox Jn N-1 
589.7 63 
= —— 4/1—- — = 67. 
8 392 era 


We can use the central limit theorem to approximate the probability that the 
sample mean differs from the population mean by more than 100 in absolute value; i.e., 
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P(|X — u| > 100). First, from the symmetry of the normal distribution, 
P(|X — u| > 100) ~ 2P(X — u > 100) 


and 
P(X — u > 100) 21— P(X — u < 100) 


X- 100 
=] »( E ) 


ex OX 





Thus the probability that the sample mean differs from the population mean by more 
than 100 is approximately .14. In fact, among the 500 samples of size 64 in Example 
A in Section 7.3.1, 82, or 16.4%, differed by more than 100 from the population mean. 
Similarly, the central limit theorem approximation gives .026 as the probability of 
deviations of more than 150 from the population mean. In the simulation in Example 
A in Section 7.3.1, 11 of 500, or 2.2%, differed by more than 150. If we are not too 
finicky, the central limit theorem gives us reasonable and useful approximations. 


is 


ox = 78 


For the particular sample of size 50 discussed in Example A in Section 7.3.2, we 
found X = 938.35, so X — u = 123.9. We now calculate an approximation of the 


probability of an error this large or larger: 
P(|X — u| > 123.9) = 1— P(|X — u| < 123.9) 
123.9 
m] 20» 1 
78 
= 2—20(1.59) 
=.11 





Thus, we can expect an error this large or larger to occur about 11% of the time. 


probability that an estimate will be off by an amount this large or larger? 


We have 
TEn n-1 
Op = 1 
n N-1 


= .068 x .94 = .064 








For a sample of size 50, the standard error of the sample mean number of discharges 


In Example C in Section 7.3.2, we found from the sample of size 50 an estimate 
p = .52 of the proportion of hospitals that discharged fewer than 1000 patients; in 
fact, the actual proportion in the population is .65. Thus, | — p | = .13. What is the 
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We can therefore calculate 


P(|p — p| > .13) = 1 — P (|p — pl < .13) 


—pl 1 
E P(2 Pl E >) 
Op op 


x 2[1 — $(2.03)] = .04 





We see that the sample was rather “unlucky”—an error this large or larger would 
occur only about 4% of the time. L| 


We can now derive a confidence interval for the population mean, u. A confi- 
dence interval for a population parameter, 0, is a random interval, calculated from the 
sample, that contains 0 with some specified probability. For example, a 95% confi- 
dence interval for jz is a random interval that contains u with probability .95; if we 
were to take many random samples and form a confidence interval from each one, 
about 95% of these intervals would contain u. If the coverage probability is 1 — o, 
the interval is called a 100(1 — w)% confidence interval. Confidence intervals are 
frequently used in conjunction with point estimates to convey information about the 
uncertainty of the estimates. 

For 0 € a < 1, let z(o) be that number such that the area under the standard 
normal density function to the right of z(@) is æ (Figure 7.3). Note that the symmetry 
of the standard normal density function about zero implies that z(1 — a) = —z(o). 
If Z follows a standard normal distribution, then, by definition of z(a), 


P(—z(@/2) < Z € z(a/2) —-1—a 


From the central limit theorem, (X — j)/oz has approximately a standard normal 
distribution, so 


X-u 





P (ze < < <(a/2)) &l-—ao 


X 


f(z) 











z(a) 


FIGURE 7.3 A standard normal density showing o and z(a). 
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Elementary manipulation of the inequalities gives 
P(X — z(a/2)0y < u € X + z(a/2)0y) & 1-a 


That is, the probability that jx lies in the interval X + z(@/2)ox is approximately 
1 — a. The interval is thus called a 100(1 — a)% confidence interval. It is important 
to understand that this interval is random and that the preceding equation states that 
the probability that this random interval covers u is 1 — a. In practice, o is assigned a 
small value, such as .1, .05, or .01, so that the probability that the interval covers u will 
be large. Also, since the population variance is typically not known, sz is substituted 
for ox. For large samples, it can be shown that the effect of this substitution is 
practically negligible. It is impossible to give a precise answer to the question “How 
large is large?" As a rule of thumb, a value of n greater than 25 or 30 is usually 
adequate. 

To illustrate the concept of a confidence interval, 20 samples each of size n — 25 
were drawn from the population of hospital discharges. From each of these 20 samples, 
an approximate 95% confidence interval for u, the mean number of discharges, was 
computed. These 20 confidence intervals are displayed as vertical lines in Figure 7.4; 
the dashed line in the figure is drawn at the true value, u = 814.6. Notice that it so 
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FIGURE 7.4 Vertical lines are 20 approximate 95% confidence intervals for u. The 
horizontal line is the true value of u. 
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happened that all the confidence intervals included jz; since these are 95% intervals, 
on the average 5%, or 1 out of 20, would not include ju. 

The following example illustrates the procedure for calculating confidence 
intervals. 


A particular area contains 8000 condominium units. In a survey of the occupants, a 
simple random sample of size 100 yields the information that the average number of 
motor vehicles per unit is 1.6, with a sample standard deviation of .8. The estimated 


standard error of X is thus 
008 1 n 
X= JaN N 
8 


100 
10 8000 
= .08 





Note that the finite population correction makes almost no difference. Since z(.025) = 
1.96, a 95% confidence interval for the population average is X+ 1.96sy, or (1.44, 
1.76). 

An estimate of the total number of motor vehicles is 7 — 8000 x 1.6 — 12,800. 
The estimated standard error of T is 


Sr = Nsy = 640 





A 95% confidence interval for the total number of motor vehicles is T + 1.9657, or 
(11,546, 14,054). 

In the same survey, 12% of the respondents said they planned to sell their condos 
within the next year; P = .12 is an estimate of the population proportion p. The 
estimated standard error is 


8p — je P) 1 a = .03 
n—-1 8000 
A 95% confidence interval for p is p + 1.96s,, or (.06, .18). 

The total number of owners planning to sell is estimated as T — N p — 960. The 
estimated standard error of T is sr = Nsp = 240. A 95% confidence interval for the 
number in the population planning to sell is T + 1.9657, or (490, 1430). The proper 
interpretation of this interval, (490, 1430), is a little subtle. We cannot state that the 
probability is 0.95 and that the number of owners planning to sell is between 490 and 
1430, because that number is either in this interval or not. What is true is that 95% of 
intervals formed in this way will contain the true number in the long run. This interval 
is like one of those shown in Figure 7.4; in the long run, 95% of those intervals will 
contain the true number of discharges, but in the figure any particular interval either 
does or doesn't contain the true number. E 











The width of a confidence interval is determined by the sample size n and the 
population standard deviation o. If o is known approximately, perhaps from earlier 
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7.4 


samples of the population, n can be chosen so as to obtain a confidence interval close 
to some desired length. Such analysis is usually an important aspect of planning the 
design of a sample survey. 


The interval for the total number of owners planning to sell in Example D might be 
considered too wide for practical purposes; reducing its width would require a larger 
sample size. Suppose that an interval with a half-width of 200 is desired. Neglecting 
the finite population correction, the half-width is 


PAP) 5095 
i965 i96 [EM M 
n—1 J/n—1 


Setting the last expression equal to 200 and solving for n yields n = 650 as the 
necessary sample size. E 





Let us summarize: The fundamental result of this section is that the sampling 
distribution of the sample mean is approximately Gaussian. This approximation can be 
used to quantify the error committed in estimating the population mean by the sample 
mean, thus giving us a good understanding of the accuracy of estimates produced 
by a simple random sample. We next introduced the idea of a confidence interval, 
a random interval that contains a population parameter with a specified probability 
and thus provides an assessment of the accuracy of the corresponding estimate of that 
parameter. We have seen in our examples that the width of the confidence interval is a 
multiple of the estimated standard deviation of the estimate; for example, a confidence 
interval for u is X + ksy, where the constant k depends on the coverage probability 
of the interval. 


Estimation of a Ratio 


The foundations of the theory of survey sampling have been laid in the preceding sec- 
tions on simple random sampling. This and the next section build on that foundation, 
developing some advanced topics in survey sampling. 

Inthis section, we consider the estimation of aratio. Suppose that for each member 
of a population, two values, x and y, may be measured. The ratio of interest is 


N 
x 


N = 
Yx Mx 
tal 


Ratios arise frequently in sample surveys; for example, if households are sampled, 
the following ratios might be calculated: 


r= 





e If y is the number of unemployed males aged 20-30 in a household and x is the 
number of males aged 20-30 in a household, then r is the proportion of unemployed 
males aged 20-30. 
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* [f y is weekly food expenditure and x is number of inhabitants, then r is weekly 
food cost per inhabitant. 

* [f y is the number of motor vehicles and x is the number of inhabitants of driving 
age, then r is the number of motor vehicles per inhabitant of driving age. 


In a survey of farms, y might be the acres of wheat planted and x the total acreage. 
In an inventory audit, y might be the audited value of an item and x the book value. 
In this section, we first consider directly the problem of estimating a ratio. Later, 
we will use the estimation of a ratio as a technique for estimating uy. We will produce 
a new estimate, the ratio estimate, which we will compare to the ordinary estimate, ys 
Before continuing, we note the elementary but sometimes overlooked fact that 


N 


Suppose that a sample is drawn consisting of the pairs (X;, Y;); the natural 
estimate of r is R = Y/X. We wish to derive expressions for E(R) and Var(R), but 
since R is a nonlinear function of the random variables X and Y, we cannot do this 
in closed form. We will therefore employ the approximate methods of Section 4.6. 

In order to calculate the approximate variance of R, we need to know Var(X), 
Var(Y), and Cov(X, Y). The first two quantities we know from Theorem B of Section 
7.3.1. For the last quantity, we define the population covariance of x and y to be 


1 


N 
Ory = — 9 i — i) Ori — Hy) 
N isi 


It can then be shown, in a manner entirely analogous to the proof of Theorem B in 
Section 7.3.1, that 








La Oxy n—l1 
Cov(X,Y)2 —[1-— 
n N-1 


From Example C in Section 4.6, we have the following theorem. 


THEOREMA 


With simple random sampling, the approximate variance of R = Y /X is 


1 
Var(R) ~ == (roy + oz — Drm) 


x 





1 n=l l 
m (1- vA i (r^c? + o? — 2ro,y) | 


The population correlation coefficient is defined as 


Oxy 





p = 
OxOy 


and is used as a measure of the strength of the linear relationship between the x and 
y values in the population. It can be shown that —1 < po < 1; large values of o 
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indicate a strong positive relationship between x and y, and small values indicate a 
strong negative relationship. (See Figure 4.7 for some illustrations of correlation.) 
The equation in Theorem A can be expressed in terms of the population correlation 
coefficient as follows: 





at helix 4.25.5 ^ 
Var(R) ~% : (: Wo r) " (r Hoe 2rpa,0,) 
From this expression, we see that strong correlation of the same sign as r decreases the 
variance. We also note that the variance is affected by the size of jL, —if ux is small, 
the variance is large, essentially because small values of X in the ratio R — Y/X 
cause R to fluctuate wildly. 

We now consider the approximate expectation of R. From Example C in Section 
4.6 and the preceding calculations, we have the following theorem. 





THEOREM B 
With simple random sampling, the expectation of R is given approximately by 
T 1 n—i 1 3 
gd ey se ie ores] pomum) E 


From the equation in Theorem B, we see that strong correlation of the same 
sign as r decreases the bias and that the bias is large if jz, is small. Furthermore, 
note that the bias is of the order 1/7, so its contribution to the mean squared error is 
of the order 1 /n. In comparison, the contribution of the variance is of the order 1/7. 
Therefore, for large samples, the bias is negligible compared to the standard error of 
the estimate. 

For large samples, truncating the Taylor series after the linear term provides a 
good approximation, since the deviations X — ux and Y — py are likely to be small. 
To this order of approximation, R is expressed as a linear combination of X and Y, 
and an argument based on the central limit theorem can be used to show that R is 
approximately normally distributed. Approximate confidence intervals can thus be 
formed for r by using the normal distribution. 

In order to estimate the standard error of R, we substitute R for r in the formula 
of Theorem A. The x and y population variances are estimated by s? and 85. The 
population covariance is estimated by 


1 : = = 
fy = x - xy - Y) 
i-l 


n—1^4 





1 


nET (Een eam) 


(as can be seen by expanding the product), and the population correlation is estimated 
by 





Sxy 





p= 
Sy Sy 
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The estimated variance of R is thus 


5 1 n—l1 1 22 2 
$$ = 1 = (Rosy + s; — 2Rs,,) 
n N-1/) Y i 


An approximate 100(1 — o)96 confidence interval for r is R + z(a/2)s5p. 








Suppose that 100 people who recently bought houses are surveyed, and the monthly 
mortgage payment and gross income of each buyer are determined. Let y denote the 
mortgage payment and x the gross income. Suppose that 


X = $3100 Y = $868 
Sy = $250 s, = $1200 
p =.85 R 228 


Neglecting the finite population correction, the estimated standard error of R is 





1 1 
sr = — | —— | v.28 x 1200? + 250? — 2 x .28 x .85 x 250 x 1200 
10 43100 


.006 


An approximate 95% confidence interval for r is .28 +(1.96) x (.006), or .28+.012. 
Note that the high correlation between x and y causes the standard error of R to be 
small. We can use the observed values for the variances, covariances, and means to 
gauge the order of magnitude of the bias by substituting them in place of the population 
parameters in the formula of Theorem B. Doing so, and again neglecting the finite 
population correction, gives the value .00015 for the bias, which is negligible relative 
to sg. Note that the large value of X and the large positive correlation coefficient 
cause the bias to be small. E 


Ratios may also be used as tools for estimating population means and totals. 
To illustrate the concept, we return to the example of hospital discharges. For this 
population, the number of beds in each hospital is also known; let us denote the number 
of beds in the ith hospital by x; and the number of discharges by y;. Suppose that 
all the x; are known, perhaps from an earlier enumeration, before a sample has been 
taken to estimate the number of discharges, and that we would like to take advantage 
of this information. One way to do this is to form a ratio estimate of j,: 


—_ Hx a, 

Yr= x Y=p,R 
where X is the average number of beds and Y is the average number of discharges in 
the sample. The idea is fairly simple: We expect x; and y; to be closely related in the 
population, since a hospital with a large number of beds should tend to have a large 
number of discharges. This is borne out by Figure 7.5, a scatterplot of the number 
of discharges versus the number of beds. If X < ux, the sample underestimates the 
number of beds and probably the number of discharges as well; multiplying Y by 
11, / X increases Y to Y pz. 
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FIGURE 7.5 Scatterplot of the number of discharges versus the number of beds for 
the 393 hospitals. 
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FIGURE 7.6 (a) A histogram of the means of 500 simple random samples of size 64 
from the population of discharges; (b) a histogram of the values of 500 ratio estimates 
of the mean number of discharges from samples of size 64. 


To see how this ratio estimate works in practice, it was simulated from 500 sam- 
ples of size 64 from the population of hospitals. The histogram of the results is shown 
in Figure 7.6 along with the histogram of the means of 500 simple random samples 
of size 64. The comparison shows dramatically how effective the ratio estimate is at 
reducing variability. 
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Two more examples will illustrate the scope of the ratio estimation method. 


Suppose that we want to estimate the total number of unemployed males aged 20-30 
from a sample of households and that we know r,, the total number of males aged 
20—30, from census data. The ratio estimate is 
Tg —t LN 
X 
where Y is the average number of unemployed males aged 20-30 per household in 
the sample, and X is the sample average number of males aged 20—30 per house- 
hold. H 


A sample of items in an inventory is taken to estimate the total value of the inventory. 
Let Y; be the audited value of the ith sample item, and let X; be its book value. We 
assume that t,, the total book value of the inventory, is known, and we estimate the 
total audited value by 


- 

Il 

m 
>| ~<] 

E 


We will now analyze the observed success of the ratio estimate. Since Ys = uxR, 
Var(Y x) = 1%, Var(R). From Theorem A, we thus have the following. 


COROLLARY A 


The approximate variance of the ratio estimate of u, is 
n—1 
N-1 





at 1 
Var(Y R) © — (1 — ) ace =e o = 2rpox0y) L| 
n 


Similarly, from Theorem B, we have another corollary. 


COROLLARY B 


The approximate bias of the ratio estimate of JL, is 





E(Y 4) — ei ee ee (ro? — ) E 
BY POS OXO0 
E bes n N—1J/ py oca 


When will the ratio estimate Y be better than the ordinary estimate Y ? In the fol- 
lowing, the finite population correction is neglected for simplicity. Since the variance 
of the ordinary estimate Y is 


N 


- 0. 
Var(Y) = — 
n 
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the ratio estimate has a smaller variance if 
2 


p c? — 2rpo,0, < 0 
or (provided r > 0, for example) 
2po, > ro, 
Letting C, = 0,/j, and C, = o,/j1y, this last inequality is equivalent to 
1/C, 
=a 8) 


C, and C, are called coefficients of variation and give the standard deviation as a 
proportion of the mean. (Coefficients of variation are often more meaningful than 
standard deviations. For example, a standard deviation of 10 means one thing if the 
true value of the quantity being measured is 100 and something entirely different if 
the true value is 10,000.) 

In order to assess the accuracy of Y g, Var(Y 4) can be estimated from the sample. 


COROLLARY C 


The variance of Y g can be estimated by 


1 —1 
2 = (1 EM ) (es. + 6 — Dis) 





Yk n N-1 


and an approximate 100(1 — o)96 confidence interval for jy is Or se 
2(5)S7,)- | 


For the population of 393 hospitals, we have 


Hx = 274.8 0, = 213.2 
Uy = 814.6 oy = 589.7 
r = 2.96 p =.91 


Thus, 
= 1 
Var(Y R) ~ —(2.96) x 213.2? + 589.7? — 2 x 2.96 x .91 x 213.2 x 589.7) 
n 
| 68,697.4 


n 


and 
" 262.1 


ie /n 


Including the finite population correction, the linearized approximation predicts that, 


with n = 64, 
1 63 

7, = <(262.1)\/1— L— = 30. 

ey, = z (262.1) 3g; = 30.0 
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The actual standard deviation of the 500 sample values displayed in Figure 7.6 is 
29.9, which is remarkably close. The mean of the 500 values is 816.2, compared to 
the population mean of 814.6; the slight apparent bias is consistent with Corollary B. 

In contrast, the standard deviation of Y from a simple random sample of size 
n = 64 is 





(00 1 n—1 

7r — Un N-1 
589.7 i 63 
— 8 329 
= 66.3 


The comparison of oy to op, is consistent with the substantial reduction in variability 
accomplished by using a ratio estimate of u, shown in Figure 7.6. 

The following is another way of interpreting this comparison. If a simple random 
sample of size n, is taken, the variance of the estimate is Var(Y) = 589.7? /ni. A 
ratio estimate from a sample of size n; will have the same variance if 


262.1? — 589.7 


n» nj 


2682.4X^ 
nmn =n] 5897) ^ .1975n, 


Thus, in this case, we can obtain the same precision from a ratio estimate using a 
sample about 8096 smaller than the simple random sample. Note that this comparison 
neglects the bias of the ratio estimate, which is justifiable in this case because the bias 
is quite small. Here is a case in which a biased estimate performs substantially better 
than an unbiased estimate, the bias being quite small and the reduction in variance 
being quite large. E 








or 





Stratified Random Sampling 


Introduction and Notation 


In stratified random sampling, the population is partitioned into subpopulations, or 
strata, which are then independently sampled. The results from the strata are then 
combined to estimate population parameters, such as the mean. 

Following are some examples that suggest the range of situations in which strat- 
ification is natural: 


In auditing financial transactions, the transactions may be grouped into strata on 
the basis of their nominal values. For example, high-value, medium-value, and 
low-value strata might be formed. 

In samples of human populations, geographical areas often form natural strata. 

In a study of records of shipments of household goods by motor carriers, the carriers 
were grouped into three strata: large carriers, medium carriers, and small carriers. 
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Stratified samples are used for a variety of reasons. We are often interested in 
obtaining information about each of a number of natural subpopulations in addition 
to information about the population as a whole. The subpopulations might be defined 
by geographical areas or age groups. In an industrial application in which the popula- 
tion consists of items produced by a manufacturing process, relevant subpopulations 
might consist of items produced during different shifts or from different lots of raw 
material. The use of a stratified random sample guarantees a prescribed number of 
observations from each subpopulation, whereas the use of a simple random sample 
can result in underrepresentation of some subpopulations. A second reason for using 
stratification is that, as will be shown below, the stratified sample mean can be con- 
siderably more precise than the mean of a simple random sample, especially if the 
population members within each stratum are relatively homogeneous and if there is 
considerable variation between strata. 

In the next section, properties of the stratified sample mean are derived. Since 
a simple random sample is taken within each stratum, the results will follow easily 
from the derivations of earlier sections. The section after that takes up the problem 
of how to allocate the total number of observations, n, among the various strata. 
Comparisons will be made of the efficiencies of different allocation schemes and 
also of the precisions of these allocation schemes relative to that of a simple random 
sample of the same total size. 


Properties of Stratified Estimates 


Suppose there are L strata in all. Let the number of population elements in stratum 
1 be denoted by N;, the number in stratum 2 be No, etc. The total population size 
is N = Ni + N2 +... + Nz. The population mean and variance of the /th stratum 
are denoted by u; and of. The overall population mean can be expressed in terms of 
the u; as follows. Let x; denote the ith population value in the /th stratum and let 
W, = Nj/N denote the fraction of the population in the /th stratum. Then 


iz Ni 
m= DH 


i=l i-l 


1 L 
= — 5 Nu 
N 


L 
= So Wu 
I= 


Within each stratum, a simple random sample of size n; is taken. The sample 
mean in stratum / is denoted by 


Here X; denotes the ith sample value in the /th stratum. Note that X, is the mean of 
a simple random sample from the population consisting of the /th stratum, so from 
Theorem A of Section 7.3.1, E(X;) = u. By analogy with the preceding relationship 
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between the overall population mean and the population means of the various strata, 
the obvious estimate of ju is 





"d 
X,-y i 


THEOREM A 


The stratified estimate, X,, of the population mean is unbiased. 


Proof 


L 


E(X,) = 3 WEQC) 


Since we assume that the samples from different strata are independent of one 
another and that within each stratum a simple random sample is taken, the variance 
of X, can be easily calculated. 


THEOREM B 


The variance of the stratified sample mean is given by 


E 
a (1- aa) @ 


iui L 





Proof 


Since the X, are independent, 
L 
Var(X,) = 5 W Va X) 
TEST 
From Theorem B of Section 7.3.1, we have 


— 1 [yi = 1 2 
Var(Xj)) = —(1— a; 
nı 


Therefore, the desired result follows. o 
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If the sampling fractions within all strata are small, 


We again consider the population of hospitals. As we did in the discussion of ratio 
estimates, we assume that the number of beds in each hospital is known but that the 
number of discharges is not. We will try to make use of this knowledge by stratifying 
the hospitals according to the number of beds. Let stratum A consist of the 98 smallest 
hospitals, stratum B of the 98 next larger, stratum C of the 98 next larger, and stratum 
D of the 99 largest. The following table shows the results of this stratification of 
hospitals by size: 








Stratum Ni W, Iu 9j 
A 98 .249 182.9 103.4 
B 98 .249 526.5 204.8 
C 98 .249 956.3 243.5 
D 99 251 1591.2 419.2 





Suppose that we use a sample of total size n and let 

n 

4 

so that we have equal sample sizes in each stratum. Then, from Theorem B, neglecting 
the finite population corrections and using the numerical values in the preceding table, 





Ny = Nn = N3 = N4 = 


we have 
4 2 
= Wo? 
Var(X,) = MEDER 
ar(X,) = M7 a 
l=1 
4 
4 2 
= W/o; 
m 
_ 72,042.6 
B n 
d 
ik 268.4 





Ox = 
X; Jn 
The standard deviation of the mean of a simple random sample is 
587.7 
ae 
Comparing the two standard deviations, we see that a tremendous gain in precision 
has resulted from the stratification. The ratio of the variances is .20; thus a stratified 
estimate based on a total sample size of n/5 is as precise as a simple random sample 
of size n. The reduction in variance due to stratification is comparable to that achieved 


Ox: 





EXAMPLEB 


7.5 Stratified Random Sampling 231 


by using a ratio estimate (Example D in Section 7.4). In later parts of this section, we 
will look more analytically at why the stratification done here produced such dramatic 
improvement. E 


Let us next consider the stratified estimate of the population total, T, = NX,. 
From Theorem B, we have the following corollary. 


COROLLARYA 
The expectation and variance of the stratified estimate of the population total are 
E(T,) — t 


and 
Var(T,) = N?Var(X,) 


VE, 

1 nj —1 
= Ne tH 2 
one (4) 1- S24) oi f 





In order to estimate the standard errors of X, and T,, the variances of the individual 
strata must be separately estimated and substituted into the preceding formulae. The 
estimate of o? is given by 


nı 


1 - 
2 2 
Som X;; — X 
s; E > l 1) 





Var(X ,) is estimated by 


1 ni 
d. aa 2 2 
oW (z) (1 x x) si 


i=1 


The next example illustrates how this variance estimate can be used to find 
approximate confidence intervals for u based on X;. 


A sample of size 10 was drawn from each of the four strata of hospitals described in 
Example A, yielding the following: 


X; = 240.6 s? = 6827.6 
X, = 507.4 52 = 23,790.7 
X, = 865.1 s? = 42,573.0 
X4 = 1716.5 sf = 152,099.6 
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Therefore, X, = 832.5. The variance of the stratified sample mean is estimated by 
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Thus, 
sx, = 35.8 


An approximate 95% confidence interval for the population mean number of dis- 
charges is X, + 1.965;., or (762.4, 902.7). 
The total number of discharges is estimated by 7, = 393X, = 327,172. The 





standard error of T, is estimated by s; = 393s, = 14,069. An approximate 
95% confidence interval for the population total is T, + 1.96s7,, or (299,596, 354, 
748). E 


Methods of Allocation 


In Section 7.5.2, it was shown that, neglecting the finite population correction, 


L 
uu Ww2 2 
Var(X,) — 5 Pior 


n 
I-1 ! 


If the resources of a survey allow only a total of n units to be sampled, the question 
arises of how to choose n,,...,, to minimize Var(X,) subject to the constraint 
ni +: +n =n. 

For the sake of simplicity, the calculations in this section ignore the finite popu- 
lation correction within each stratum. The analysis may be extended to include these 
corrections, but at the cost of some additional algebra. More complete results are 
contained in Cochran (1977). 


THEOREM A 
The sample sizes nı, ..., ng that minimize Var(X,) subject to the constraint 
ni +- +n =n are given by 
Wio; 
Ku EE 
> Wok 
k=l 
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Proof 


We introduce a Lagrange multiplier, and we must then minimize 


L W2o? £ 
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Setting these partial derivatives equal to zero, we have the system of equations 





Wo) 
ny = 
Vi 
forl = 1,..., L. To determine i, we first sum these equations over l: 
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Thus, 
p n 
A ONT 
Jf > Wo 
I=1 
d 
an we 
ni =n 
L 
X Wio 
I-i 
which proves the theorem. a 


This theorem shows that those strata for which W/o; is large should be sampled 
heavily. This makes sense intuitively. If W, is large, the stratum contains a large 
fraction of the population; if o is large, the population values in the stratum are 
quite variable, and in order to obtain a good determination of the stratum’s mean, a 
relatively large sample size must be used. This optimal allocation scheme is called 
Neyman allocation. 

Substituting the optimal values of n, as given in Theorem A into the equation for 
Var(X,) given in Theorem B in Section 7.5.2 gives us the following corollary. 


COROLLARY A 


Denoting by X;,, the stratified estimate using the optimal allocations as given in 
Theorem A and neglecting the finite population correction, 


7 2 
Ev» 
Var(X,,) = Nee we 
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EXAMPLE A For the population of hospitals, the weights for optimal allocation, W;o;/ >> W101, 
are, from the table of Example A of Section 7.5.2, 


Stratum 


A B C D 
Weight .106 .210 .250 .434 





Note that, because of its larger standard deviation, stratum D is sampled more than 
four times as heavily as stratum A. E 


The optimal allocations depend on the individual variances of the strata, which 
generally will not be known. Furthermore, if a survey measures several attributes 
for each population member, it is usually impossible to find an allocation that is 
simultaneously optimal for all of those variables. A simple and popular alternative 
method of allocation is to use the same sampling fraction in each stratum, 


Mi u n» u u nr, 
Ni Nz NL 
which holds if 
Ni 
n; —n— —nW, 
N 
for] = 1,..., L. This method is called proportional allocation. The estimate of the 
population mean based on proportional allocation is 
L 
Xop = x WX, 
l=1 
L 1 ni 
x: W;— Xi 
ia o CDU 
1 L ni 
MX 
I=1 i=l 
since W;/n; = 1/n. This estimate is simply the unweighted mean of the sample 
values. 
THEOREM B 


With stratified sampling based on proportional allocation, ignoring the finite 
population correction, 


= pe 
Var(Xsp) = — Y Wo? 
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Proof 


From Theorem B of Section 7.5.2, we have 
L 


Var(X sp) = ^ W Va X) 


Ei 
L 
29] 
OMM. 
n 
I=1 i 


Using n; = nW,, the result follows. L| 


We now compare Var(X sp) and Var(X,,) in order to discover the circumstances 
under which optimal allocation is substantially better than proportional allocation. 


THEOREM C 


With stratified random sampling, the difference between the variance of the 
estimate of the population mean based on proportional allocation and the variance 
of that estimate based on optimal allocation is, ignoring the finite population 
correction, 


= " p : 
Var(X,) — Var(X,,) = =) Wilo — 8) 


E 
where 


Proof 
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The term within the large brackets equals Da W, (o; — &)*, which may be 
verified by expanding the square and collecting terms. H 


According to Theorem C, if the variances of the strata are all the same, propor- 
tional allocation yields the same results as optimal allocation. The more variable these 
variances are, the better it is to use optimal allocation. 
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EXAMPLEB Let us calculate how much better optimal allocation is than proportional allocation 
for the population of hospitals. From Theorem C and Corollary A, we have 


= < 1 
Var(X,,) = Var(X,,) + — 5 Wi (01 — 8) 





Therefore, 
1 
= d 
Var(X sp) B » D Wi(o; — &) 
Var(X so) — Var(X yo) 
W, — RA 
=1+ b» 1 (01 d 
O^ Wai)? 
=1+4.218 


Thus, under proportional allocation, the variance of the mean is about 20% larger 
than it is under optimal allocation. E 


We can also compare the variance under simple random sampling with the vari- 
ance under proportional allocation. The variance under simple random sampling is, 
neglecting the finite population correction, 


c co? 
Var(X) = — 
n 


In order to compare this equation with that for the variance under proportional allo- 
cation, we need a relationship between the overall population variance, o?, and the 
strata variances, o. The overall population variance may be expressed as 
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Also, 
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When both sides of this last equation are summed over /, the middle term on the 
right-hand side becomes zero since N; u; = rd Xil, SO We have 
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X Gu — uy = M Gu — uy + Ns - Y 
i=l i=l 
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Dividing both sides by N and summing over /, we have 
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Substituting this expression for o? into Var(X) — o?/n and using the formula for 
Var(X,,) given in Theorem B completes a proof of the following theorem. 


THEOREM D 


The difference between the variance of the mean of a simple random sample and 
the variance of the mean of a stratified random sample based on proportional 
allocation is, neglecting the finite population correction, 


= = d 
Var(X) — Var(Xsp) = — 5 Wil — w? a 
=I) 


Thus, stratified random sampling with proportional allocation always gives a 
smaller variance than does simple random sampling, providing that the finite popu- 
lation correction is ignored. Comparing the equations for the variances under simple 
random sampling, proportional allocation, and optimal allocation, we see that strat- 
ification with proportional allocation is better than simple random sampling if the 
strata means are quite variable and that stratification with optimal allocation is even 
better than stratification with proportional allocation if the strata standard deviations 
are variable. 


We calculate the improvement that would result from using stratification with propor- 
tional allocation rather than simple random sampling for the population of hospitals. 
From Theorems B and D, we have 


Var(X srs) =j D W, (uu = a» 
Var(X sp) p» W, of 
= 143.83 


Asis frequently the case, the gain from using stratification with proportional allocation 
rather than simple random sampling is much greater than the gain from using optimal 
allocation rather than proportional allocation. Furthermore, proportional allocation 
requires knowledge only of the sizes of the strata, whereas optimal allocation requires 
knowledge of the standard deviations of the strata, and such knowledge is usually 
unavailable. E 


Typically, stratified random sampling can result in substantial increases in preci- 
sion for populations containing values that vary greatly in size. For example, a pop- 
ulation of transactions, a sample of which is to be audited for errors, might contain 
transactions in the hundreds of thousands of dollars and transactions in the hundreds 
of dollars. If such a population were divided into several strata according to the dollar 
amounts of the transactions, there might well be considerable variation in the mean 
transaction errors between the strata, since there may be rather large errors on large 
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transactions and small errors on small transactions. The variability of the errors might 
also be larger in the former strata as well. 

We have not addressed the question of how many strata to form and how to 
define the strata. In order to construct the optimal number of strata, the population 
values themselves, which are of course unknown, would have to be used. Stratification 
must therefore be done on the basis of some related variable that is known (such as 
transaction amount in the preceding paragraph) or on the results of earlier samples. 
In practice, it usually turns out that such relationships are not strong enough to make 
it worthwhile constructing more than a few strata. 


Concluding Remarks 


This chapter introduced survey sampling. It first covered the most elementary method 
of probability sampling—simple random sampling. The theory of this method under- 
lies the theory of more complex sampling techniques. Stratified sampling was also in- 
troduced and shown to increase the precision of estimates substantially in many cases. 

Several concepts and techniques introduced here recur throughout statistics: the 
concept of a random estimate of a population parameter, such as the population mean; 
bias; the standard error of an estimate; confidence intervals based on the central limit 
theorem; and linearization, or propagation of error. 

The theory and technique of survey sampling go far beyond the material in 
this introduction. One method that deserves mention because of its widespread use 
is systematic sampling. The population members are given in a list. If, say, a 10% 
sample is desired, every tenth member of the list is sampled starting from some random 
point among the first ten. If the list is in totally random order, this method is similar 
to simple random sampling. If, however, there is some correlation or relationship 
between successive members, the method is more similar to stratified sampling. The 
clear danger of this method is that there may be some periodic structure in the list, in 
which case bias can ensue. 

Another commonly used method is cluster sampling. In sampling residential 
households, a survey might choose blocks randomly and then either sample every 
dwelling on each chosen block or further subsample the dwellings. Because one 
would expect dwellings within a single block to be relatively homogeneous, this 
method is typically less precise than a simple random sample of the same size. 

We have developed a mathematical model for survey sampling and have deduced 
consequences of that model, including probabilistic error bounds for the estimates. 
As is always the case, reality never quite matches the mathematical model. The 
basic assumptions of the model are (1) that every population member appears in 
the sample with a specified probability and (2) that an exact measurement or response 
is obtained from every sample member. In practice, neither assumption will hold pre- 
cisely. Converse and Traugott (1986) provide an interesting discussion of the practical 
difficulties of polls and surveys and consequences for the variability of the estimates. 

The first assumption may fail because of the difficulty of obtaining an ex- 
act enumeration of the population or because of imprecision in its definition. For 
example, political surveys can be putatively based on all adults, all registered voters, 
or all "likely" voters. However, the most serious problem with respect to the first 
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assumption is that of nonresponse. Response levels of only 6096 to 7096 are common 
in surveys of human populations. The possibility of substantial bias clearly arises if 
there is a relationship of potential answers to survey questions to the propensity to 
respond to those questions. For example, adults living in families are easier to contact 
by a telephone survey than those living alone, and the opinions of these two groups 
may well differ on certain issues. It is important to realize that the standard errors 
of estimates that we have developed earlier in this chapter account only for random 
variability in sample composition, not for systematic biases. 

The Literary Digest poll of 1936, which predicted a 5796 to 4396 victory for 
Republican Alfred Landon over incumbent president Franklin Roosevelt, is one of 
the most famous of flawed surveys. Questionnaires were mailed to about 10 million 
voters, who were selected from lists such as telephone books and club memberships, 
and approximately 2.4 million of the questionnaires were returned. There were two 
intrinsic problems: (1) nonresponse—those who did not respond may have voted dif- 
ferently from those who did—and (2) selection bias—even if all 10 million voters 
had responded, they would not have constituted a random sample; those in lower 
socioeconomic classes (who were more likely to vote for Roosevelt) were less likely 
to have telephone service or belong to clubs and thus less likely to be included in 
the sample than were wealthier voters. The assumption that an exact measurement is 
obtained from every member of the sample may also be in error. In surveys conducted 
by interviewers, the interviewer's approach and personality may affect the response. 
In surveys that use questionnaires, the wording of the questions and the context within 
which they are lodged can have an effect. An interesting example is a poll conducted 
by Stanley Presser, (New Yorker, Oct 18, 2004). Half of the sample was asked, “Do 
you think the United States should allow public speeches against democracy?" The 
other half was asked, “Do you think the United States should forbid public speeches 
against democracy?" 56% said no to the first question, and 39% said yes to the second. 
The interesting paper by Hansen in Tanur et al. (1972) reports on efforts of the U.S. 
Bureau of the Census to investigate these sorts of problems. 


Problems 


1. Consider a population consisting of five values—1, 2, 2, 4, and 8. Find the 
population mean and variance. Calculate the sampling distribution of the mean 
of a sample of size 2 by generating all possible such samples. From them, find 
the mean and variance of the sampling distribution, and compare the results to 
Theorems A and B in Section 7.3.1. 


2. Suppose that a sample of size n = 2 is drawn from the population of the preceding 
problem and that the proportion of the sample values that are greater than 3 is 
recorded. Find the sampling distribution of this statistic by listing all possible 
such samples. Find the mean and variance of the sampling distribution. 


3. Which of the following is a random variable? 


a. The population mean 
b. The population size, 
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11. 


. The sample size, n 

. The sample mean 

. The variance of the sample mean 

. The largest value in the sample 

. The population variance 

. The estimated variance of the sample mean 


Smo Qn 


. Two populations are surveyed with simple random samples. A sample of size n, 


is used for population I, which has a population standard deviation g; a sample of 
size n; = 2n, is used for population IL, which has a population standard deviation 
05 = 20. Ignoring finite population corrections, in which of the two samples 
would you expect the estimate of the population mean to be more accurate? 


. How would you respond to a friend who asks you, “How can we say that the 


sample mean is a random variable when it is just a number, like the population 
mean? For example, in Example A of Section 7.3.2, a simple random sam- 
ple of size 50 produced x — 938.5; how can the number 938.5 be a random 
variable?" 


. Suppose that two populations have equal population variances but are of different 


sizes: N; = 100,000 and Nz = 10,000,000. Compare the variances of the sample 
means for a sample of size n — 25. Is it substantially easier to estimate the mean 
of the smaller population? 


. Suppose that asimple random sample is used to estimate the proportion of families 


in acertain area that are living below the poverty level. If this proportion is roughly 
.15, what sample size is necessary so that the standard error of the estimate is .02? 


. A sample of size n = 100 is taken from a population that has a proportion 


p= 1/5. 

a. Find ô such that P(|p — p| > 6) = 0.025. 

b. If, in the sample, 5 = 0.25, will the 95% confidence interval for p contain 
the true value of p? 


. In a simple random sample of 1,500 voters, 55% said they planned to vote for a 


particular proposition, and 45% said they planned to vote against it. The estimated 
margin of victory for the proposition is thus 1096. What is the standard error of 
this estimated margin? What is an approximate 95% confidence interval for the 
margin? 


True or false (and state why): 
If a sample from a population is large, a histogram of the values in the sample 
will be approximately normal, even if the population is not normal. 


Consider a population of size four, the members of which have values x1, x», xs, X4. 


a. If simple random sampling were used, how many samples of size two are 
there? 

b. Suppose that rather than simple random sampling, the following sampling 
scheme is used. The possible samples of size two are 


bx, x2}, Do, X3}, (03, Xa}, {X1, X4} 
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and the sampling is done in such a way that each of these four possible samples 
is equally likely. Is the sample mean unbiased? 

Consider simple random sampling with replacement. 

a. Show that 


1 - u2 
aie 





s = 


is an unbiased estimate of o?. 
. Is s an unbiased estimate of o? 
. Show that n^! s? is an unbiased estimate of o2. 
. Show that n^! N?s? is an unbiased estimate of o7. 
. Show that p(1 — p)/(n — 1) is an unbiased estimate of 05. 


eco c 


Suppose that the total number of discharges, v, in Example A of Section 7.2 is 
estimated from a simple random sample of size 50. Denoting the estimate by T, 
use the central limit theorem to sketch the approximate probability density of the 
error T — T. 


The proportion of hospitals in Example A of Section 7.2 that had fewer than 1000 
discharges is p — .654. Suppose that the total number of hospitals having fewer 
than 1000 discharges is estimated from a simple random sample of size 25. Use 
the central limit theorem to sketch the approximate sampling distribution of the 
estimate. 


Consider estimating the mean of the population of hospital discharges (Exam- 
ple A of Section 7.2) from a simple random sample of size n. Use the normal 
approximation to the distribution of X in answering the following: 


a. Sketch P(|X — u| > 200) as a function of n for 20 < n < 100. 
b. For n = 20, 40, and 80, find A such that P(|X — u| > A) ~ .10. Similarly, 
find A such that P(X — u| > A) ~ .50. 


True or false? 


a. The center of a 95% confidence interval for the population mean is a random 
variable. 

b. A 95% confidence interval for u contains the sample mean with probability 
95: 

€. A 95% confidence interval contains 95% of the population. 

d. Out of one hundred 95% confidence intervals for u, 95 will contain u. 


A 90% confidence interval for the average number of children per household 
based on a simple random sample is found to be (.7, 2.1). Can we conclude that 
90% of households have between .7 and 2.1 children? 


From independent surveys of two populations, 90% confidence intervals for the 
population means are constructed. What is the probability that neither interval 
contains the respective population mean? That both do? 


This problem introduces the concept of a one-sided confidence interval. Using 
the central limit theorem, how should the constant k be chosen so that the interval 
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21. 


22. 


23. 


24. 


25. 


(—oo, X + ksx) is a 90% confidence interval for 1 —i.e., so that P(u < X + 
ksy) = .9? This is called a one-sided confidence interval. How should k be 
chosen so that (X — ksz, oo) is 95% one-sided confidence interval? 


In Example D of Section 7.3.3, a 95% confidence interval for u was found to be 
(1.44, 1.76). Because u is some fixed number, it either lies in this interval or it 
doesn't, so it doesn’t make any sense to claim that P(1.44 < u < 1.76) = .95. 
What do we mean, then, by saying this is a “95% confidence interval?" 


In order to halve the width of a 9596 confidence interval for a mean, by what factor 
should the sample size be increased? Ignore the finite population correction. 


Aninvestigator quantifies her uncertainty about the estimate of a population mean 
by reporting X + sy. What size confidence interval is this? 





a. Show that the standard error of an estimated proportion is largest when p — 
1/2. 
b. Use this result and Corollary B of Section 7.3.2 to conclude that the 


quantity 
1 N-—n 
2V N(n— 1) 


is a conservative estimate of the standard error of P no matter what the value 
of p may be. 
c. Use the central limit theorem to conclude that the interval 


EN N-n 
PEN NG- D 


contains p with probability at least .95. 





For a random sample of size n from a population of size NV, consider the following 
as an estimate of u: 


where the c; are fixed numbers and X,,..., X, is the sample. 


a. Find a condition on the c; such that the estimate is unbiased. 
b. Show that the choice of c; that minimizes the variances of the estimate subject 
to this condition is c; = 1/n, where i = 1,...,7. 


Here is an alternative proof of Lemma B in Section 7.3.1. Consider a random 


permutation Y;, Yo, ..., Yy of x1, Xo, ..., xy. Argue that the joint distribution of 
any subcollection, Y;,,..., Y;,, of the Y; is the same as that of a simple random 
sample, X,,..., X,. In particular, 


Var(Y;) = Var(X,) = o? 
and 


Cov(Y;, Y;) = Cov(X,, Xi) 2 y 
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ifi Z j and k Z l. Since Y; + Y +--+ Yn =T, 
N 
Var (>: «) =0 
i-l 


(Why?) Express Varo ı Yi) in terms of c? and the unknown covariance, y. 
Solve for y, and conclude that 





fori + j. 

This is another proof of Lemma B in Section 7.3.1. Let U; be a random vari- 

able with U; = 1 if the ith population member is in the sample and equal to 0 

otherwise. 

a. Show that the sample mean X = n^! 35, , Ujx;. 

b. Show that P(U; = 1) = n/N. Find E(U;), using the fact that U; is a Bernoulli 
random variable. 

c. What is the variance of the Bernoulli random variable U;? 

d. Noting that U;U; is a Bernoulli random variable, find E(U;U;), i # j. (Be 
careful to take into account that the sample is drawn without replacement.) 

e. Find Cov(U;, Uj), i Æ j. 

f. Using the representation of X above, find Var(X). 


Suppose that the population size N is not known, but it is known that n < N. 
Show that the following procedure will generate a simple random sample of 
size n. Imagine that the population is arranged in a long list that you can read 
sequentially. 


a. Let the sample initially consist of the the first n elements in the list. 
b. Fork = 1,2,..., as long as the end of the list has not been encountered: 


i. Read the (n + k)-th element in the list. 
ii. Place it in the sample with probability n/(n + k) and, if it is placed in the 
sample, randomly drop one of the exisiting sample members. 


In surveys, it is difficult to obtain accurate answers to sensitive questions such as 
“Have you ever used heroin?" or “Have you ever cheated on an exam?" Warner 
(1965) introduced the method of randomized response to deal with such sit- 
uations. A respondent spins an arrow on a wheel or draws a ball from an urn 
containing balls of two colors to determine which of two statements to respond 
to: (1) “I have characteristic A,” or (2) “I do not have characteristic A.” The inter- 
viewer does not know which statement is being responded to but merely records 
a yes or ano. The hope is that an interviewee is more likely to answer truthfully 
if he or she realizes that the interviewer does not know which statement is being 
responded to. Let R be the proportion of a sample answering Yes. Let p be the 
probability that statement 1 is responded to (p is known from the structure of 
the randomizing device), and let q be the proportion of the population that has 
characteristic A. Let r be the probability that a respondent answers Yes. 


a. Showthatr = (2p— 1)q-- (1— p). [Hint: P(yes) = P(yes given question 1) x 
P (question 1) + P (yes given question 2) x P(question 2).] 


244 


Chapter 7 Survey Sampling 


29. 
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32. 
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34. 


b. If r were known, how could q be determined? 

c. Show that E(R) = r, and propose an estimate, Q, for q. Show that the estimate 
is unbiased. 

d. Ignoring the finite population correction, show that 


t= 

Var(R) = rd-r) 
n 
where n is the sample size. 

e. Find an expression for Var(Q). 


A variation of the method described in Problem 28 has been proposed. Instead 
of responding to statement 2, the respondent answers an unrelated question for 
which the probability of a “yes” response is known, for example, “Were you born 
in June?” 


a. Propose an estimate of q for this method. 
b. Show that the estimate is unbiased. 
c. Obtain an expression for the variance of the estimate. 


Compare the accuracies of the methods of Problems 28 and 29 by comparing their 
standard deviations. You may do this by substituting some plausible numerical 
values for p and q. 


Referring to Example D in Section 7.3.3, how large should the sample be in order 
that the 95% confidence interval for the total number of owners planning to sell 
will have a width of 500? 


Referring again to Example D in Section 7.3.3, suppose that a survey is done of 
another condominium project of 12,000 units. The sample size is 200, and the 
proportion planning to sell in this sample is .18. 


a. What is the standard error of this estimate? Give a 90% confidence interval. 

b. Suppose we use the notation P, = .12 and p; = .18 to refer to the proportions 
in the two samples. Let d = f, — f» be an estimate of the difference, d, of 
the two population proportions p; and p». Using the fact that 5, and P» are 
independent random variables, find expressions for the variance and standard 
error of d. 

c. Because f, and f» are approximately normally distributed, so is d. Use this 
fact to construct 99%, 95%, and 90% confidence intervals for d. Is there clear 
evidence that p; is really different from p2? 


Two populations are independently surveyed using simple random samples of 
size n, and two proportions, p, and p», are estimated. It is expected that both 
population proportions are close to .5. What should the sample size be so that the 
standard error of the difference, P, — P», will be less than .02? 


In a survey of a very large population, the incidences of two health problems are 
to be estimated from the same sample. It is expected that the first problem will 
affect about 3% of the population and the second about 40%. Ignore the finite 
population correction in answering the following questions. 
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a. How large should the sample be in order for the standard errors of both esti- 
mates to be less than .01? What are the actual standard errors for this sample 
size? 

b. Suppose that instead of imposing the same limit on both standard errors, the 
investigator wants the standard error to be less than 10% of the true value in 
each case. What should the sample size be? 


A simple random sample of a population of size 2000 yields the following 
25 values: 


104 109 111 109 87 
86 80 119 88 122 
91 103 99 108 96 

104 98 98 83 107 
79 87 94 92 97 


a. Calculate an unbiased estimate of the population mean. 
b. Calculate unbiased estimates of the population variance and Var(X). 
c. Give approximate 9546 confidence intervals for the population mean and total. 


TEE 20 1v : : : 
With simple random sampling, is X^ an unbiased estimate of u”? If not, what is 
the bias? 


Two surveys were independently conducted to estimate a population mean, ju. 
Denote the estimates and their standard errors by X; and X» and ox, and oy,. 
Assume that X; and X» are unbiased. For some o and f, the two estimates can 
be combined to give a better estimator: 


X — aX + BX> 


a. Find the conditions on o and £ that make the combined estimate unbiased. 
b. What choice of œ and £ minimizes the variances, subject to the condition of 
unbiasedness? 


1 n 
Let X,,..., X, be a simple random sample. Show that — p» X ? is an unbiased 
n 
1 i=l 
estimate of — ? 
Suppose that of a population of N items, k are defective in some way. For exam- 
ple, the items might be documents, a small proportion of which are fraudulent. 
How large should a sample be so that with a specified probability it will contain 
at least one of the defective items? For example, if N = 10,000, k = 50, and 
p = .95, what should the sample size be? Such calculations are useful in planning 
sample sizes for acceptance sampling. 


This problem presents an algorithm for drawing a simple random sample from a 
population in a sequential manner. The members of the population are considered 
for inclusion in the sample one at a time in some prespecified order (for example, 
the order in which they are listed). The ith member of the population is included 


246 


Chapter 7 Survey Sampling 


41. 


42. 


43. 


44. 


45. 


in the sample with probability 


n — ni 
N—i-cl 


where n; is the number of population members already in the sample before the 
ith member is examined. Show that the sample selected in this way is in fact 
a simple random sample; that is, show that every possible sample occurs with 


probability 
N 
n 


In accounting and auditing, the following sampling method is sometimes used to 
estimate a population total. In estimating the value of an inventory, suppose that 
a book value exists for each item and is readily accessible. For each item in the 
sample, the difference D, audited value minus book value, is determined. The 
inventory value is estimated by the sum of the book values of the population and 
ND, where N is the population size. 


a. Show that the estimate is unbiased. 

b. Find an expression for the variance of the estimate. 

c. Compare the expression obtained in part (b) to the variance of the usual es- 
timate, which is the product of N and the average audited value. Under what 
circumstances would the proposed method be more accurate? 


d. How could a ratio estimate be employed in this situation? Would there be any 


advantage or disadvantage to using a ratio estimate rather than the proposed 
method? 


Show that the population correlation coefficient is less than or equal to 1 in 
absolute value. 


Suppose that for Example D in Section 7.3.3, the average number of occupants 
per condominium unit in the sample is 2.2 with a sample standard deviation of 
.7 and the sample correlation coefficient between the number of occupants and 
the number of motor vehicles is .85. Estimate the population ratio of the number 
of motor vehicles per occupant and its standard error. Find an approximate 9596 
confidence interval for the estimate. 


Show that 


Var(Y p) " Cs (S 2 ) 
Var(Y) Cy XU, E 


Sketch the graph of this ratio as a function of C,/C,. 


In the population of hospitals, the correlation of the number of beds and the num- 
ber of discharges is p = .91 (Example D of Section 7.4). To see how Var(Y g) 
would be different if the correlation were different, plot Var(Y g) for n = 64 as 
a function of p for —1 < p < 1. 


46. 


47. 


48. 


49. 
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Use the central limit theorem to sketch the approximate sampling distribution 
of Y g for n = 64 for the population of hospitals. Compare to the approximate 
sampling distribution of Y. 


For the population of hospitals and a sample size of n = 64, find the approxi- 
mate bias of Yg by applying Corollary B of Section 7.4 and compare it to the 
approximate standard deviation of the estimate. Repeat for n = 128. 


A simple random sample of 100 households located in a city recorded the number 
of people living in the household, X, and the weekly expenditure for food, Y. It 
is known that there are 100,000 households in the city. In the sample 


Nx — 320 


Y Y; = 10,000 


x? = 1250 
Y? = 1,100,000 
XC XY, = 36,000 


Neglect the finite population correction in answering the following. 


a. Estimate the ratio r = py/[,. 

b. Form an approximate 95% confidence interval for jy /{.. 

c. Using only the data on Y estimate the total weekly food expenditure, t, for 
households in the city and form a 90% confidence interval. 


In a wildlife survey, an area of desert land was divided into 1000 squares, or 
“quadrats,” a simple random sample of 50 of which were surveyed. In each sur- 
veyed quadrat, the number of birds, Y, and the area covered by vegetation, X, 
were determined. It was found that 


p» — 3000 
SOY, = 150 


x? = 225,000 


jc Y? = 650 
XO XY, = 11,000 


a. Estimate the ratio of the average number of birds per quadrat to the average 
vegetation cover per quadrat. 

b. Estimate the standard error of your estimate and find an approximate 90% 
confidence interval for the population average. 

c. Estimate the total number of birds and find an approximate 95% confidence 
interval for the population total. 

d. Suppose that from an aerial survey, the total area covered by vegetation could 
easily be determined. How could this information be used to provide another 
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50. 


51. 


52. 


estimate of the number of birds? Would you expect this estimate to be better 
than or worse than that found in part (c)? 


Hartley and Ross (1954) derived the following exact bound on the relative size 
of the bias and standard error of a ratio estimate: 





et) ote. m (i 1) 
OR Hx Mx n 


a. Derive this bound from the relation 





Cov(R »-z( Y) g( d ) £00) 
aic UE; X 


b. Apply the bound to Problem 43 using sample estimates in place of the given 
population parameters. 


This problem introduces a technique called the “jackknife,” originally proposed 
by Quenouille (1956) for reducing bias. Many nonlinear estimates, including the 
ratio estimator, have the property that 


A bi b 
EÔ) =0+ 2424... 
n n 


where @ is an estimate of 0. The jackknife forms an estimate 6,, which has a 
leading bias term of the order n^? rather than n^!. Thus, for sufficiently large 
n, the bias of Ô; is substantially smaller than that of 6. The technique involves 
splitting the sample into several subsamples, computing the estimate for each 
subsample, and then combining the several estimates. The sample is split into p 
groups of size m, where n = mp. For j = 1,..., p, the estimate 6; is calculated 
from the m(p — 1) observations left after the jth group has been deleted. From 
the preceding expression, 


bi + b; Tee 
P-D mp- DP 





EÔ;) 204 
m 
Now, p “pseudovalues” are defined: 
The jackknife estimate, Ó;, is defined as the average of the pseudovalues: 


p 
5 V; 
j=l 


0, = 


Vile 


Show that the bias of 6 ; is of the order n~?. 


A population consists of three strata with N; = N2 = 1000 and N; = 500. 
A Stratified random sample with 10 observations in each stratum yields the 


53. 


54. 
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following data: 


Stratum 1 94 99 106 106 101 102 122 104 07 97 
Stratum 2. 183 183 179 211 178 179 192 192 201 177 
Stratum 3 343 302 286 317 289 284 357 288 314 276 


Estimate the population mean and total and give a 9096 confidence interval. 


The following table (Cochran 1977) shows the stratification of all farms in a 
county by farm size and the mean and standard deviation of the number of acres 
of corn in each stratum. 





Farm Size Ni Hi o 

0—40 394 5.4 8.3 
41-80 461 16.3 13.3 
81-120 391 24.3 15.1 
121-160 334 34.5 19.8 
161—200 169 42.1 24.5 
201-240 113 50.1 26.0 
241 + 148 63.8 35.2 


. For a sample size of 100 farms, compute the sample sizes from each stratum 


for proportional and optimal allocation, and compare them. 


. Calculate the variances of the sample mean for each allocation and compare 


them to each other and to the variance of an estimate formed from simple 
random sampling. 


. What are the population mean and variance? 
. Suppose that ten farms are sampled per stratum. What is Var(X,)? How large 


a simple random sample would have to be taken to attain the same variance? 
Ignore the finite population correction. 


. Repeat part (d) using proportional allocation of the 70 samples. 


. Suppose that the cost of a survey is C = Co + Cin, where Co is a startup 


cost and C, is the cost per observation. For a given cost C, find the al- 
location 7;,...,nr; to L strata that is optimal in the sense that it mini- 
mizes the variance of the estimate of the population mean subject to the cost 
constraint. 


. Suppose that the cost of an observation varies from stratum to stratum—in 


some strata the observations might be relatively cheap and in others relatively 
expensive. The cost of a survey with an allocation n;,..., nz is 


L 


C 2C) Cm 


l=1 


For a fixed total cost C, what choice of nı, - - -, ng minimizes the variance? 


. Assuming that the cost function is as given in part (b), for a fixed variance, 


find n; to minimize cost. 
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55. 


56. 


57. 


58. 


59. 


60. 


61. 


The designer of a sample survey stratifies a population into two strata, H and L. 
H contains 100,000 people, and L contains 500,000. He decides to allocate 100 
samples to stratum H and 200 to stratum L, taking a simple random sample in 
each stratum. 


a. How should the designer estimate the population mean? 

b. Suppose that the population standard deviation in stratum H is 20 and the 
standard deviation in stratum L is 10. What will be the standard error of his 
estimate? 

c. Would it be better to allocate 200 samples to stratum H and 100 to stratum L? 

d. Would it be better to use proportional allocation? 


How might stratification be used in each of the following sampling problems? 


a. A survey of household expenditures in a city. 

b. A survey to examine the lead concentration in the soil in a large plot of land. 

c. Asurvey to estimate the number of people who use elevators in a large building 
with a single bank of elevators. 

d. A survey of programs on a television station, taken to estimate the proportion 
of time taken up by advertising on Monday through Friday from 6 P.M. until 
10P.M. Assume that 52 weeks of recorded broadcasts are available for analysis. 


Consider stratifying the population of Problem 1 into two strata: (1, 2, 2) and (4, 
8). Assuming that one observation is taken from each stratum, find the sampling 
distribution of the estimate of the population mean and the mean and standard 
deviation of the sampling distribution. Compare to Theorems A and B in Section 
7.5.2 and the results of Problem 1. 


(Computer Exercise) Construct a population consisting of the integers from 1 to 
100. Simulate the sampling distribution of the sample mean of a sample of size 
12 by drawing 100 samples of size 12 and making a histogram of the results. 


(Computer Exercise) Continuing with Problem 58, divide the population into 
two strata of equal size, allocate six observations per stratum, and simulate 
the distribution of the stratified estimate of the population mean. Do the same 
thing with four strata. Compare the results to each other and to the results of 
Problem 58. 


A population consists of two strata, H and L, of sizes 100,000 and 500,000 and 
standard deviations 20 and 12, respectively. A stratified sample of size 100 is to 
be taken. 


a. Find the optimal allocation for estimating the population mean. 

b. Find the optimal allocation for estimating the difference of the means of the 
strata, Lg — UL. 

The value of a population mean increases linearly through time: u(t) = o + Bt 

while the variance remains constant. Independent simple random samples of size 

n are taken at times f = 1, 2, and 3. 


a. Find conditions on w1, w2, and w3 such that 


^ 


B = wıXı + w2X2. + w3X3 





62. 


63. 


64. 
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is an unbiased estimate of the rate of change, P. Here X; denotes the sample 
mean at time f;. 

b. What values of the w; minimize the variance subject to the constraint that the 
estimate is unbiased? 


In Example B of Section 7.5.2, the standard error of X, was estimated to be 
Sx, = 35.8. How good is this estimate—what is the actual standard error of X,? 


(Open-ended) Monte Carlo evaluation of an integral was introduced in Example 
A of Section 5.2. Refer to that example for the following notation. Try to interpret 
that method from the point of view of survey sampling by considering an “infinite 
population” of numbers in the interval [0, 1], each population member x having 
a value f(x). Interpret Î( f) as the mean of a simple random sample. What is 
the standard error of 7 (f£)? How could it be estimated? How could a confidence 
interval for 7 (f£) be formed? Do you think that anything could be gained by 
stratifying the “population?” For example, the strata could be the intervals [0, .5) 
and [.5, 1]. You might find it helpful to consider some examples. 


The value of an inventory is to be estimated by sampling. The items are stratified 
by book value in the following way: 





Stratum Ni Hi ol 

$1000 + 70 3000 1250 
$200-1000 500 500 100 
$1—200 10,000 90 30 


a. What should the relative sampling fraction in each stratum be for proportional 
and for optimal allocation? Ignore the finite population correction. 

b. How do the variances under each type of allocation compare to each other and 
to the variance under simple random sampling? 


The disk file cancer contains values for breast cancer mortality from 1950 to 
1960 (y) and the adult white female population in 1960 (x) for 301 counties in 
North Carolina, South Carolina, and Georgia. 


a. Make a histogram of the population values for cancer mortality. 

b. What are the population mean and total cancer mortality? What are the pop- 
ulation variance and standard deviation? 

c. Simulate the sampling distribution of the mean of a sample of 25 observations 
of cancer mortality. 

d. Draw a simple random sample of size 25 and use it to estimate the mean and 
total cancer mortality. 

e. Estimate the population variance and standard deviation from the sample of 
part (d). 

f. Form 95% confidence intervals for the population mean and total from the 
sample of part (d). Do the intervals cover the population values? 

g. Repeat parts (d) through (f) for a sample of size 100. 

h. Suppose that the size of the total population of each county is known and that 
this information is used to improve the cancer mortality estimates by forming 
a ratio estimator. Do you think this will be effective? Why or why not? 
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66. 


67. 


i. 


Simulate the sampling distribution of ratio estimators of mean cancer mortal- 
ity based on a simple random sample of size 25. Compare this result to that 
of part (c). 


. Drawasimple random sample of size 25 and estimate the population mean and 


total cancer mortality by calculating ratio estimates. How do these estimates 
compare to those formed in the usual way in part (d) from the same data? 


. Form confidence intervals about the estimates obtained in part ( j). 
. Stratify the counties into four strata by population size. Randomly sample six 


observations from each stratum and form estimates of the population mean 
and total mortality. 


. Stratify the counties into four strata by population size. What are the sam- 


pling fractions for proportional allocation and optimal allocation? Compare 
the variances of the estimates of the population mean obtained using simple 
random sampling, proportional allocation, and optimal allocation. 


. How much better than those in part (m) will the estimates of the population 


mean be if 8, 16, 32, or 64 strata are used instead? 


A photograph of a large crowd on a beach is taken from a helicopter. The photo 
is of such high resolution that when sections are magnified, individual people 
can be identified, but to count the entire crowd in this way would be very time- 
consuming. Devise a plan to estimate the number of people on the beach by using 
a sampling procedure. 


The data set families contains information about 43,886 families living in 
the city of Cyberville. The city has four regions: the Northern region has 10,149 
families, the Eastern region has 10,390 families, the Southern region has 13,457 
families, and the Western region has 9,890. For each family, the following infor- 
mation is recorded: 


1. 


Un oo 


Family type 

1: Husband-wife family 
2: Male-head family 

3: Female-head family 


. Number of persons in family 
. Number of children in family 
. Family income 

. Region 


1: North 
2: East 
3: South 
4: West 


. Education level of head of household 


31: Less than Ist grade 

32: Ist, 2nd, 3rd, or 4th grade 
33: 5th or 6th grade 

34: 7th or 8th grade 

35: 9th grade 

36: 10th grade 

37: 11th grade 


38: 
39: 
40: 
41: 
42: 
43: 
44: 
45: 
46: 


In the 
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12th grade, no diploma 

High school graduate, high school diploma, or equivalent 

Some college but no degree 

Associate degree in college (occupation/vocation program) 
Associate degree in college (academic program) 

Bachelor’s degree (e.g., B.S., B.A., A.B.) 

Master's degree (e.g., M.S., M.A., M.B.A.) 

Professional school degree (e.g., M.D., D.D.S., D.V.M., LL.B., J.D.) 
Doctoral degree (e.g., Ph.D., Ed.D.) 


se exercises, you will try to learn about the families of Cyberville by using 


sampling. 


a. Take a simple random sample of 500 families. Estimate the following popula- 


tion 


parameters, calculate the estimated standard errors of these estimates, and 


form 95% confidence intervals: 


iv. 


. The proportion of female-headed families 
. The average number of children per family 
iii. The proportion of heads of households who did not receive a high school 


diploma 
The average family income 


Repeat the preceding parameters for five different simple random samples of 


size 


i. 
ii. 


iii. 


vi. 


vii. 


c. For 


500 and compare the results. 


. Take 100 samples of size 400. 


For each sample, find the average family income. 

Find the average and standard deviation of these 100 estimates and make 
a histogram of the estimates. 

Superimpose a plot of a normal density with that mean and standard devi- 
ation of the histogram and comment on how well it appears to fit. 


. Plot the empirical cumulative distribution function (see Section 10.2). On 


this plot, superimpose the normal cumulative distribution function with 
mean and standard deviation as earlier. Comment on the fit. 


. Another method for examining a normal approximation is via a normal 


probability plot (Section 9.9). Make such a plot and comment on what it 
shows about the approximation. 

For each of the 100 samples, find a 9596 confidence interval for the pop- 
ulation average income. How many of those intervals actually contain the 
population target? 

Take 100 samples of size 100. Compare the averages, standard deviations, 
and histograms to those obtained for a sample of size 400 and explain how 
the theory of simple random sampling relates to the comparisons. 


a simple random sample of 500, compare the incomes of the three family 


types by comparing histograms and boxplots (see Chapter 10.6). 


. Take simple random samples of size 400 from each of the four regions. 


i. Compare the incomes by region by making parallel boxplots. 
ii. Does it appear that some regions have larger families than others? 
iii. Are there differences in education level among the four regions? 
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e. Formulate a question of your choice and attempt to answer it with a simple 


f. 


random sample of size 400. 

Does stratification help in estimating the average family income? From a simple 
random sample of size 400, estimate the average income and also the standard 
error of your estimate. Form a 9546 confidence interval. Next, allocate the 400 
observations proportionally to the four regions and estimate the average income 
from the stratified sample. Estimate the standard error and form a 95% confi- 
dence interval. Compare your results to the results ofthe simple random sample. 


8.1 


8.2 


CHAPTER 8 


Estimation of Parameters 
and Fitting of Probability 
Distributions 


Introduction 


In this chapter, we discuss fitting probability laws to data. Many families of probability 
laws depend on a small number of parameters; for example, the Poisson family de- 
pends on the parameter A (the mean number of counts), and the Gaussian family 
depends on two parameters, u and ø. Unless the values of parameters are known in 
advance, they must be estimated from data in order to fit the probability law. 

After parameter values have been chosen, the model should be compared to the 
actual data to see if the fit is reasonable; Chapter 9 is concerned with measures and 
tests of goodness of fit. 

In order to introduce and illustrate some of the ideas and to provide a concrete 
basis for later theoretical discussions, we will first consider a classical example—the 
fitting of a Poisson distribution to radioactive decay. The concepts introduced in this 
example will be elaborated in this and the next chapter. 


Fitting the Poisson Distribution to Emissions 
of Alpha Particles 


Records of emissions of alpha particles from radioactive sources show that the num- 
ber of emissions per unit of time is not constant but fluctuates in a seemingly random 
fashion. If the underlying rate of emission is constant over the period of observation 
(which will be the case if the half-life is much longer than the time period of obser- 
vation) and if the particles come from a very large number of independent sources 
(atoms), the Poisson model seems appropriate. For this reason, the Poisson distribu- 
tion is frequently used as a model for radioactive decay. You should recall that the 
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Poisson distribution as a model for random counts in space or time rests on three 
assumptions: (1) the underlying rate at which the events occur is constant in space 
or time, (2) events in disjoint intervals of space or time occur independently, and (3) 
there are no multiple events. 

Berkson (1966) conducted a careful analysis of data obtained from the National 
Bureau of Standards. The source of the alpha particles was americium 241. The 
experimenters recorded 10,220 times between successive emissions. The observed 
mean emission rate (total number of emissions divided by total time) was .8392 
emissions per sec. The clock used to record the times was accurate to .0002 sec. 

The first two columns of the following table display the counts, n, that were 
observed in 1207 intervals, each of length 10 sec. In 18 of the 1207 intervals, there 
were 0, 1, or 2 counts; in 28 of the intervals there were 3 counts, etc. 











n Observed Expected 
0-2 18 12.2, 
3 28 27.0 
4 56 56.5 
3 105 94.9 
6 126 132.7 
7 146 159.1 
8 164 166.9 
9 161 155.6 
10 123 130.6 
11 101 99.7 
12 74 69.7 
13 253 45.0 
14 23 27.0 
15 15 15.1 
16 9 7.9 
17+ 5 7.1 

1207 1207 


In fitting a Poisson distribution to the counts shown in the table, we view the 
1207 counts as 1207 independent realizations of Poisson random variables, each of 
which has the probability mass function 





In order to fit the Poisson distribution, we must estimate a value for A from the 
observed data. Since the average count in a 10-second interval was 8.392, we take 
this as an estimate of A (recall that the E(X) = A) and denote it by A 

Before continuing, we want to mention some issues that will be explored in 
depth in subsequent sections of this chapter. First, observe that if the experiment 
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were to be repeated, the counts would be different and the estimate of A would be 
different; it is thus appropriate to regard the estimate of à as a random variable which 
has a probability distribution referred to as its sampling distribution. The situation 
is entirely analogous to tossing a coin 10 times and regarding the number of heads 
as a binomially distributed random variable. Doing so and observing 6 heads generates 
one realization of this random variable; in the same sense 8.392 is a realization of a 
random variable. The question thus arises: what is the sampling distribution? This is 
of some practical interest, since the spread of the sampling distribution reflects the 
variability of the estimate. We could ask crudely, to what decimal place is the estimate 
8.392 accurate? Second, later in this chapter we will discuss the rationale for choosing 
to estimate A as we have done. Although estimating A as the observed mean count is 
quite reasonable on its face, in principle there might be better procedures. 

We now turn to assessing goodness of fit, a subject that will be taken up in depth 
in the next chapter. Consider the 16 cells into which the counts are grouped. Under 
the hypothesized model, the probability that a random count falls in any one of the 
cells may be calculated from the Poisson probability law. The probability that an 
observation falls in the first cell (0, 1, or 2 counts) is 


pi = To + T 7 


The probability that an observation falls in the second cellis p2 = 73. The probability 
that an observation falls in the 16th cell is 


P16 = 5 Tk 
k=17 

Under the assumption that X1, ..., X1207 are independent Poisson random variables, 
the number of observations out of 1207 falling in a given cell follows a binomial 
distribution with a mean, or expected value, of 1207 pz, and the joint distribution of the 
counts in all the cells is multinomial with n = 1207 and probabilities pi, po, ..., pie. 
The third column of the preceding table gives the expected number of counts in each 
cell; for example, because p4 — .0786, the expected count in the corresponding cell 
is 1207 x .0786 — 94.9. Qualitatively, there is good agreement between the expected 
and observed counts. Quantitative measures will be presented in Chapter 9. 


Parameter Estimation 


As was illustrated in the example of alpha particle emissions, in order to fit a probabil- 
ity law to data, one typically has to estimate parameters associated with the probability 
law from the data. The following examples further illustrate this point. 


Normal Distribution 
The normal, or Gaussian, distribution involves two parameters, u and o, where u is 
the mean of the distribution and o? is the variance: 
_1 euo? 


—oo«x «0o 


1 
flu. o) = mU 
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EXAMPLE B 
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P(x) 
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FIGURE 8.1 Gaussian fit of current flow across a cell membrane to a frequency 
polygon. 


The use of the normal distribution as a model is usually justified using some 
version of the central limit theorem, which says that the sum of a large number of 
independent random variables is approximately normally distributed. For example, 
Bevan, Kullberg, and Rice (1979) studied random fluctuations of current across a 
muscle cell membrane. The cell membrane contained a large number of channels, 
which opened and closed at random and were assumed to operate independently. The 
net current resulted from ions flowing through open channels and was therefore the 
sum of a large number of roughly independent currents. As the channels opened and 
closed, the net current fluctuated randomly. Figure 8.1 shows a smoothed histogram 
of values obtained from 49,152 observations of the net current and an approximat- 
ing Gaussian curve. The fit of the Gaussian distribution is quite good, although the 
smoothed histogram seems to show a slight skewness. In this application, informa- 
tion about the characteristics of the individual channels, such as conductance, was 
extracted from the estimated parameters u and o°. E 


Gamma Distribution 
The gamma distribution depends on two parameters, œ and A: 


f (x|a, A) = LL xta 0x xxoo 
a 


The family of gamma distributions provides a flexible set of densities for nonnegative 
random variables. 

Figure 8.2 shows how the gamma distribution fits to the amounts of rainfall 
from different storms (Le Cam and Neyman 1967). Gamma distributions were fit 
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FIGURE 8.2 Fit of gamma densities to amounts of rainfall for (a) seeded and 
(b) unseeded storms. 


to rainfall amounts from storms that were seeded and unseeded in an experiment to 
determine the effects, if any, of seeding. Differences in the distributions between the 
seeded and unseeded conditions should be reflected in differences in the parameters 


a and i. L| 


As these examples illustrate, there are a variety of reasons for fitting probability 
laws to data. A scientific theory may suggest the form of a probability distribution 
and the parameters of that distribution may be of direct interest to the scientific inves- 
tigation; the examples of alpha particle emission and Example A are of this character. 
Example B is typical of situations in which a model is fit for essentially descriptive 
purposes as a method of data summary or compression. A probability model may 
play a role in a complex modeling situation; for example, utility companies interested 
in projecting patterns of consumer demand find it useful to model daily temperatures 
as random variables from a distribution of a particular form. This distribution may 
then be used in simulations of the effects of various pricing and generation schemes. 
In a similar way, hydrologists planning uses of water resources use stochastic models 
of rainfall in simulations. 
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9.4 


We will take the following basic approach to the study of parameter estimation. 
The observed data will be regarded as realizations of random variables X1, X5, ..., Xn, 
whose joint distribution depends on an unknown parameter 0. Note that 0 may be a 
vector, such as (a, A) in Example B. Usually the X; will be modeled as independent 
random variables all having the same distribution f (x|0), in which case their joint dis- 
tribution is f (x1|0) f (x210) -- - f (x,|0). We will refer to such X; as independent and 
identically distributed, or i.i.d. An estimate of 0 will be a function of X4, X5, ..., Xn 
and will hence be a random variable with a probability distribution called its sampling 
distribution. We will use approximations to the sampling distribution to assess the 
variability of the estimate, most frequently through its standard deviation, which is 
commonly called its standard error. 

Itis desirable to have general procedures for forming estimates so that each new 
problem does not have to be approached ab initio. We will develop two such proce- 
dures, the method of moments and the method of maximum likelihood, concentrating 
primarily on the latter, because it is the more generally useful. 

The advanced theory of statistics is heavily concerned with “optimal estimation,” 
and we will touch lightly on this topic. The essential idea is that given a choice of many 
different estimation procedures, we would like to use that estimate whose sampling 
distribution is most concentrated around the true parameter value. 

Before going on to the method of moments, let us note that there are strong 
similarities of the subject matter of this and the previous chapter. In Chapter 7 we were 
concerned with estimating population parameters, such as the mean and total, and the 
process of random sampling created random variables whose probability distributions 
depended on those parameters. We were concerned with the sampling distributions 
of the estimates and with assessing variability via standard errors and confidence 
intervals. In this chapter we consider models in which the data are generated from a 
probability distribution. This distribution usually has a more hypothetical status than 
that of Chapter 7, where the distribution was induced by deliberate randomization. In 
this chapter we will also be concerned with sampling distributions and with assessing 
variability through standard errors and confidence intervals. 


The Method of Moments 


The kth moment of a probability law is defined as 
ny = E(X") 


where X is arandom variable following that probability law (of course, this is defined 
only if the expectation exists). If X4, X5, ..., X, are i.i.d. random variables from that 
distribution, the kth sample moment is defined as 


We can view fi as an estimate of u. The method of moments estimates parameters 
by finding expressions for them in terms of the lowest possible order moments and 
then substituting sample moments into the expressions. 


EXAMPLEA 


Suppose, for example, that we wish to estimate two parameters, 6, and 05. If 6, 
and 0» can be expressed in terms of the first two moments as 
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A, = fila, M2) 

6) = fai, u2) 
then the method of moments estimates are 

à = fifi, Aa) 

ô, = fofi ĝa) 





The construction of a method of moments estimate involves three basic steps: 


1. Calculate low order moments, finding expressions for the moments in terms of the 
parameters. Typically, the number of low order moments needed will be the same 
as the number of parameters. 

2. Invert the expressions found in the preceding step, finding new expressions for the 
parameters in terms of the moments. 

3. Insert the sample moments into the expressions obtained in the second step, thus 
obtaining estimates of the parameters in terms of the sample moments. 


To illustrate this procedure, we consider some examples. 


Poisson Distribution 
The first moment for the Poisson distribution is the parameter A = E(X). The first 
sample moment is 


R-X-LY x 


i=l 


which is, therefore, the method of moments estimate of À: Ài-X. 

As a concrete example, let us consider a study done at the National Institute of 
Science and Technology (Steel et al. 1980). Asbestos fibers on filters were counted 
as part of a project to develop measurement standards for asbestos concentration. 
Asbestos dissolved in water was spread on a filter, and 3-mm diameter punches were 
taken from the filter and mounted on a transmission electron microscope. An operator 
counted the number of fibers in each of 23 grid squares, yielding the following counts: 


31 29 19 18 31 28 
34 27 34 30 16 18 
26 27 27 18 24 22 
28 24 21 17 24 


The Poisson distribution would be a plausible model for describing the variability 
from grid square to grid square in this situation and could be used to characterize the 
inherent variability in future measurements. The method of moments estimate of À is 
simply the arithmetic mean of the counts listed above, these or A = 24.9. 

If the experiment were to be repeated, the counts—and therefore the estimate— 
would not be exactly the same. It is thus natural to ask how stable this estimate is. 


262 


Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 


A standard statistical technique for addressing this question is to derive the sampling 
distribution of the estimate or an approximation to that distribution. The statistical 
model stipulates that the individual counts X; are independent Poisson random vari- 
ables with parameter Ao. Letting S = J` X;, the parameter estimate À = S/nisa 
random variable, the distribution of which is called its sampling distribution. Now 
from Example E in Section 4.5, the distribution of the sum of independent Poisson 
random variables is Poisson distributed, so the distribution of S is Poisson (nào). 
Thus the probability mass function of À is 


P(A =v) = P(S =nv) 
7 (nAo)"" e"> 
(nv)! 


for v such that nv is a nonnegative integer. 
Since S is Poisson, its mean and variance are both nào, so 


EÔ) = -E(S) = 3o 
F 1 Xo 
Var(A) = —, Var(S) = — 
n n 

From Example A in Section 5.3, if Ap is large, the distribution of S is approximately 

normal; hence, that of À is approximately normal as well, with mean and variance 

given above. Because E(A) = Ao, we say that the estimate is unbiased: the sampling 

distribution is centered at Ao. The second equation shows that the sampling distribution 

becomes more concentrated about A as n increases. The standard deviation of this 

distribution is called the standard error of 4 and is 


Ào 


Of course, we can’t know the sampling distribution or the standard error of Â because 
they depend on Ao, which is unknown. However, we can derive an approximation by 
substituting Â and Ao and use it to assess the variability of our estimate. In particular, 
we can calculate the estimated standard error of i as 


312 


$$ — 


For this example, we find 


At the end of this section, we will present a justification for using A in place of Ao. 

In summary, we have found that the sampling distribution of 4 is approximately 
normal, centered at the true value Ao with standard deviation 1.04. This gives us 
a reasonable assessment of the variability of our parameter estimate. For example, 
because a normally distributed random variable is unlikely to be more than two 
standard deviations away from its mean, the error in our estimate of A is unlikely to 
be more than 2.08. We thus have not only an estimate of Ao, but also an understanding 
of the inherent variability of that estimate. 


EXAMPLEB 


EXAMPLEC 
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In Chapter 9, we will address the question of whether the Poisson distribution 
really fits these data. Clearly, we could calculate the average of any batch of numbers, 
whether or not they were well fit by the Poisson distribution. E 


Normal Distribution 
The first and second moments for the normal distribution are 


Mm =E(X)=u 
m = E(X’) = p + 0° 
Therefore, 
H = Hi 
o’ = H2 = mi 


The corresponding estimates of u and o? from the sample moments are 


From Section 6.3, the sampling distribution of X is N (u, o?/n) and n6?/o? ~ 
X2. ,. Furthermore, X and 6? are independently distributed. We will return to these 
sampling distributions later in the chapter. E 


Gamma Distribution 
The first two moments of the gamma distribution are 


_@ 
Mp 

a (o +1) 
My = 3 


A2 
(see Example B in Section 4.5). To apply the method of moments, we must express 
a and à in terms of jz; and u2. From the second equation, 


2, M 
= e 
M2 = by n 
Or 
X pues 
H2 = Hi 
Also, from the equation for the first moment given here, 
2 
a = àu = Ea 
H2 = Hi 


The method of moments estimates are, since ó 


| xl 


h= 
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FIGURE 8.3 Gamma densities fit by the methods of moments and by the method of 
maximum likelihood to amounts of precipitation; the solid line shows the method of 
moments estimate and the dotted line the maximum likelihood estimate. 


and 
"s 
â= ga 

As a concrete example, let us consider the fit of the amounts of precipitation 
during 227 storms in Illinois from 1960 to 1964 to a gamma distribution (Le Cam and 
Neyman 1967). The data, listed in Problem 42 at the end of Chapter 10, were gathered 
and analyzed in an attempt to characterize the natural variability in precipitation 
from storm to storm. A histogram shows that the distribution is quite skewed, so a 
gamma distribution is a natural candidate for a model. For these data, X = .224 and 
6? = .1338, and therefore à = .375 and À = 1.674. 

The histogram with the fitted density is shown in Figure 8.3. Note that, in order 
to make visual comparison easy, the density was normalized to have a total area equal 
to the total area under the histogram, which is the number of observations times the 
bin width of the histogram, or 227 x .2 — 45.4. Alternatively, the histogram could 
have been normalized to have a total area of 1. Qualitatively, the fit in Figure 8.3 looks 
reasonable; we will examine it in more detail in Example C in Section 9.9. L| 


We now turn to a discussion of the sampling distributions of â and 4. In the previ- 
ous two examples, we were able to use known theoretical results in deriving sampling 
distributions, but it appears that it would be difficult to derive the exact forms of the 
sampling distributions of Â and â, because they are each rather complicated functions 
of the sample values X1, X5, ..., Xn. However, the problem can be approached by 
simulation. Imagine for the moment that we knew the true values Ao and a. We could 
generate many, many samples of size n — 227 from the gamma distribution with 
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these parameter values, and from each of these samples we could calculate estimates 
of A and a. A histogram of the values of the estimates of A, for example, should then 
give us a good idea of the sampling distribution of À. 

The only problem with this idea is that it requires knowing the true parameter 
values. (Notice that we faced a problem very much like this in Example A.) So we 
substitute our estimates of A and o for the true values; that is we draw many, many 
samples of size n = 227 from a gamma distribution with parameters o. = .375 and 
à = 1.674. The results of drawing 1000 such samples of size n = 227 are displayed 
in Figure 8.4. Figure 8.4(a) is a histogram of the 1000 estimates of o so obtained and 
Figure 8.4(b) shows the corresponding histogram for A. These histograms indicate the 
variability that is inherent in estimating the parameters from a sample of this size. For 
example, we see that if the true value of o is .375, then it would not be very unusual 
for the estimate to be in error by .1 or more. Notice that the shapes of the histograms 
suggest that they might be approximated by normal densities. 

The variability shown by the histograms can be summarized by calculating the 
standard deviations of the 1000 estimates, thus providing estimated standard errors of 
â and Â. To be precise, if the 1000 estimates of o are denoted by a7, i = 1, 2, ..., 1000, 
the standard error of @ is estimated as 





1000 


1 
4 — — *. av 
"d 1000 dai a 


where o is the mean of the 1000 values. The results of this calculation and the 
corresponding one for A are s; = .06 and s; = .34. These standard errors are concise 
quantifications of the amount of variability of the estimates @ = .375 and Â = 1.674 
displayed in Figure 8.4. 
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FIGURE 8.4 Histogram of 1000 simulated method of moment estimates of (a) a 
and (b) A. 
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EXAMPLED 


Our use of simulation (or Monte Carlo) here is an example of what in statistics 
is called the bootstrap. We will see more examples of this versatile method later. 


An Angular Distribution 
The angle 0 at which electrons are emitted in muon decay has a distribution with the 
density 


fala) = : -l<x<l and —-l<a<l 





where x = cos 0. The parameter o is related to polarization. Physical considerations 
dictate that |a| < +, but we note that f(x|o) is a probability density for |æ] < 1. 
The method of moments may be applied to estimate o from a sample of experimental 
measurements, X,,..., X,. The mean of the density is 


f. lteox, a 
— X eS — 
as TUNE 3 


1 





Thus, the method of moments estimate of œ is @ = 3X. Consideration of the sampling 
distribution of G is left as an exercise (Problem 13). E 


Under reasonable conditions, method of moments estimates have the desirable 
property of consistency. An estimate, Ó, is said to be a consistent estimate of a 
parameter, 6, if Ó approaches Ó as the sample size approaches infinity. The following 
states this more precisely. 


DEFINITION 


Let 6, be an estimate of a parameter 0 based on a sample of size n. Then Ó, is said 
to be consistent in probability if 0, converges in probability to 0 as n approaches 
infinity; that is, for any e > 0, 


P(jd, 6| > e) > 0 asn — oo L| 


The weak law of large numbers implies that the sample moments converge in 
probability to the population moments. If the functions relating the estimates to the 
sample moments are continuous, the estimates will converge to the parameters as the 
sample moments converge to the population moments. 

The consistency of method of moments estimates can be used to provide a jus- 
tification for a procedure that we used in estimating standard errors in the previous 
examples. We were interested in the variance (or its square root—the standard error) 
of a parameter estimate Ó. Denoting the true parameter by 65, we had a relationship 
of the form 


05 = —=0 (00) 


Jn 
(In Example A, o; = 4/Ao/n, so that o (à) = V/A.) We approximated this by the 


8.5 
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estimated standard error 


provided that the function o (0) is continuous in 0. The result follows since if 6 — bo, 
then o (0) — 0o (09). Of course, this is just a limiting result and we always have a 
finite value of n in practice, but it does provide some hope that the ratio will be close 
to 1 and that the estimated standard error will be a reasonable indication of variability. 

Let us summarize the results of this section. We have shown how the method 
of moments can provide estimates of the parameters of a probability distribution 
based on a “sample” (an i.i.d. collection) of random variables from that distribution. 
We addressed the question of variability or reliability of the estimates by observing 
that if the sample is random, the parameter estimates are random variables having 
distributions that are referred to as their sampling distributions. The standard deviation 
of the sampling distribution is called the standard error of the estimate. We then faced 
the problem of how to ascertain the variability of an estimate from the sample itself. 
In some cases the sampling distribution was of an explicit form depending upon 
the unknown parameters (Examples A and B); in these cases we could substitute 
our estimates for the unknown parameters in order to approximate the sampling 
distribution. In other cases the form of the sampling distribution was not so obvious, 
but we realized that even if we didn't know it explicitly, we could simulate it. By 
using the bootstrap we avoided doing perhaps difficult analytic calculations by sitting 
back and instructing a computer to generate random numbers. 


The Method of Maximum Likelihood 


As well as being a useful tool for parameter estimation in our current context, the 
method of maximum likelihood can be applied to a great variety of other statistical 
problems, such as curve fitting, for example. This general utility is one of the major 
reasons for the importance of likelihood methods in statistics. We will later see that 
maximum likelihood estimates have nice theoretical properties as well. 


Suppose that random variables X,,..., X, have a joint density or frequency 
function f (xj, X2, ..., x,|0). Given observed values X; = x;, where i = 1,...,n, 
the likelihood of 0 as a function of x1, x», ..., x, is defined as 


lik(0) = f (x1, X2, ... Xnl) 


Note that we consider the joint density as a function of 0 rather than as a function of 
the x;. If the distribution is discrete, so that f is a frequency function, the likelihood 
function gives the probability of observing the given data as a function of the para- 
meter 0. The maximum likelihood estimate (mle) of 0 is that value of 0 that max- 
imizes the likelihood—that is, makes the observed data “most probable" or “most 
likely." 
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If the X; are assumed to be i.i.d., their joint density is the product of the marginal 
densities, and the likelihood is 


lik@) = [ [ £16) 
i=1 


Rather than maximizing the likelihood itself, itis usually easier to maximize its natural 
logarithm (which is equivalent since the logarithm is a monotonic function). For an 
i.i.d. sample, the log likelihood is 


10) = X logl f (X;19)] 
i=l 
(In this text, “log” will always mean the natural logarithm.) 
Let us find the maximum likelihood estimates for the examples first considered 
in Section 8.4. 


Poisson Distribution 
If X follows a Poisson distribution with parameter A, then 





Arte 
P(X =x)= 
x! 
If X,,..., X, are iid. and Poisson, their joint frequency function is the product of 


the marginal frequency functions. The log likelihood is thus 


IO) = SOX log A — A — log X;!) 


i=l 


= logi Y^ X; —nA-— Y^ log Xj 
i=l i=l 


Log likelihood 











FIGURE 8.5 Plot of the log likelihood function of A for asbestos data. 


EXAMPLE B 
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Figure 8.5 is a graph of [(A) for the asbestos counts of Example A in Section 8.4. 
Setting the first derivative of the log likelihood equal to zero, we find 


The mle is then 


We can check that this is indeed a maximum (in fact, /(A) is a concave function 
of A; see Figure 8.5). The maximum likelihood estimate agrees with the method of 
moments for this case and thus has the same sampling distribution. E 


Normal Distribution 
If Xj, X5, ..., X, arei.i.d. N (m, a’), their joint density is the product of their marginal 
densities: 





n 1 llx;i-u i 
fenes ndo D Eee ;| o lj 


Regarded as a function of jz and ø, this is the likelihood function. The log likelihood 
is thus 





n 1 x 4 
l 4 = l 1 2 Xi = £z 
(1, 0) Moged EAT = 5 a >. ( 4) 
The partials with respect to u and o are 


9l Lx 
- 32.06 - 0 
i=1 


ðu 

al n E 3 
uL mele t X= 
poc t 2 u) 


Setting the first partial equal to zero and solving for the mle, we obtain 


Setting the second partial equal to zero and substituting the mle for u, we find that 
the mle for o is 





Again, these estimates and their sampling distributions are the same as those obtained 
by the method of moments. E 
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EXAMPLE C Gamma Distribution 
Since the density function of a gamma distribution is 


1 
f(xla, à) = a am ee 0xx-«oo 
l(a) 
the log likelihood of an i.i.d. sample, X;,..., Xn, is 


l(a, A) = S cla log À + (a — 1) log X; — AX; — logI (o)] 


i=l 


= na log à + (a — 1) Slog X; -AM X, — nlogI (o) 


i=] i=1 


The partial derivatives are 





al i I"(ao) 
—=nloga J log X; — 

Ja "98 T = ve T rœ) 
al na Z 

Rc ee X; 

Or À 


i-l 
Setting the second partial equal to zero, we find 
A na a 
À => n = —- 
X 





But when this solution is substituted into the equation for the first partial, we obtain 
a nonlinear equation for the mle of o: 
(a 
(à) _ 4 
râ) 





n 
nlog& — nlog X + S log X; —n 
i=l 
This equation cannot be solved in closed form; an iterative method for finding the 
roots has to be employed. To start the iterative procedure, we could use the initial 
value obtained by the method of moments. 

For this example, the two methods do not give the same estimates. The mle’s 
are computed from the precipitation data of Example C in Section 8.4 by an iterative 
procedure (a combination of the secant method and the method of bisection) using the 
method of moments estimates as starting values. The resulting estimates are@ = .441 
and Â = 1.96. In Example C in Section 8.4, the method of moments estimates were 
found to be à = .375 and Â = 1.674. Figure 8.3 shows fitted densities from both 
types of estimates of o and A. There is clearly little practical difference, especially if 
we keep in mind that the gamma distribution is only a possible model and should not 
be taken as being literally true. 

Because the maximum likelihood estimates are not given in closed form, 
obtaining their exact sampling distribution would appear to be intractable. We thus 
use the bootstrap to approximate these distributions, just as we did to approximate 
the sampling distributions of the method of moments estimates. The underlying ratio- 
nale is the same: If we knew the “true” values, oo and Ao, say, we could approximate 
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FIGURE 8.6 Histograms of 1000 simulated maximum likelihood estimates of (a) œ 
and (b) A. 


the sampling distribution of their maximum likelihood estimates by generating many, 
many samples of size n = 227 from a gamma distribution with parameters o and 
Ao, forming the maximum likelihood estimates from each sample, and displaying the 
results in histograms. Since, of course, we don't know the true values, we let our 
maximum likelihood estimates play their role: We generated 1000 samples each of 
size n = 227 of gamma distributed random variables with a = .471 and A = 1.97. 
For each of these samples, the maximum likelihood estimates of o and à were calcu- 
lated. Histograms of these 1000 estimates are shown in Figure 8.6; we regard these 
histograms as approximations to the sampling distribution of the maximum likelihood 
estimates @ and A. 

Comparison of Figures 8.6 and 8.4 is interesting. We see that the sampling dis- 
tributions of the maximum likelihood estimates are substantially less dispersed than 
those of the method of moments estimates, which indicates that in this situation, the 
method of maximum likelihood is more precise than the method of moments. The 
standard deviations of the values displayed in the histograms are the estimated stan- 


dard errors of the maximum likelihood estimates; we find s; — .03 and s; — .26. 
Recall that in Example C of Section 8.4 the corresponding estimated standard errors 
for the method of moments estimates were found to be .06 and .34. L| 
Muon Decay 


From the form of the density given in Example D in Section 8.4, the log likelihood is 


l(o) = yoga + aX;) —nlog2 


i=l 


Setting the derivative equal to zero, we see that the mle of o satisfies the following 
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8.5.1 


nonlinear equation: 


n X; 
Tyan 7o 


Again, we would have to use an iterative technique to solve for à. The method of 
moments estimate could be used as a starting value. E 


In Examples C and D, in order to find the maximum likelihood estimate, we 
would have to solve a nonlinear equation. In general, in some problems involving 
several parameters, systems of nonlinear equations must be solved to find the mle’s. 
We will not discuss numerical methods here; a good discussion is found in Chapter 6 
of Dahlquist and Bjorck (1974). 


Maximum Likelihood Estimates of Multinomial 
Cell Probabilities 


The method of maximum likelihood is often applied to problems involving multino- 


mial cell probabilities. Suppose that X1, . . . , Xm, the counts in cells 1, ..., m, follow 
a multinomial distribution with a total count of n and cell probabilities pi, ..., Pm- 
We wish to estimate the p's from the x's. The joint frequency function of X,,..., Xm 


1S 


! m 
fa. e Xn pi. tes Pm) == x lI» 
IIx: i=l 
i=l 
Note that the marginal distribution of each X; is binomial (n, p;), and that since 
the X; are not independent (they are constrained to sum to n), their joint frequency 
function is not the product of the marginal frequency functions, as it was in the 
examples considered in the preceding section. We can, however, still use the method 
of maximum likelihood since we can write an expression for the joint distribution. 
We assume n is given, and we wish to estimate p;,..., Pm with the constraint that 
the p; sum to 1. From the joint frequency function just given, the log likelihood is 
m m 
l(pi. <--> Pm) = logn! — 9 logx;! + ^ x; log p; 
i=l i=l 
To maximize this likelihood subject to the constraint, we introduce a Lagrange mul- 
tiplier and maximize 


L(pi,.... pu. A) = logn! — So logxi! + ÑC x; log pi +A (>: p= 7 
i=l i=l i=l 


Setting the partial derivatives equal to zero, we have the following system of 
equations: 


zu 
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Summing both sides of this equation, we have 


—n 
l = — 
À 
Or 
à = -n 
Therefore, 
c 
pj = " 


which is an obvious set of estimates. The sampling distribution of p; is determined 
by the distribution of x;, which is binomial. 

In some situations, such as frequently occur in the study of genetics, the multi- 
nomial cell probabilities are functions of other unknown parameters 0; that is, p; — 
pi (0). In such cases, the log likelihood of 0 is 


1(@) = logn! — y log x;! ge b» log p;(8) 


i=1 i=1 


Hardy-Weinberg Equilibrium 

If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur in a 
population with frequencies (1 — 0)?, 20(1 — 6), and 6, according to the Hardy- 
Weinberg law. In a sample from the Chinese population of Hong Kong in 1937, 
blood types occurred with the following frequencies, where M and N are erythrocyte 
antigens: 











Blood Type 
M MN N Total 
Frequency 342 500 187 1029 


There are several possible ways to estimate 0 from the observed frequencies. For ex- 
ample, if we equate 0? with 187/1029, we obtain .4263 as an estimate of 0. Intuitively, 
however, it seems that this procedure ignores some of the information in the other 
cells. If we let X1, X2, and X5 denote the counts in the three cells and let n = 1029, 
the log likelihood of 0 is (you should check this): 


3 
10) = logn! — 5 7 log X;! + Xilog(1 — 0)? + Xp log 20(1 — 6) + Xslog6? 


i-l 


3 
= logn! — 5 log X;! + (2X, + X>) log(1 — 6) 
i=l 
+ (2X3 + X2) log + X2 log 2 
In maximizing / (0), we do not need to explicitly incorporate the constraint that the cell 
probabilities sum to 1 since the functional form of p; (0) is such that d pi (0) = 1. 
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Setting the derivative equal to zero, we have 


2Xı +X:  2X3+ X, 
1-0 0 i 





Solving this, we obtain the mle: 


2X4 + X 
2X, + ey 2X 
2X3 + X2 

2n 

2 x 187 + 500 
~ 2x 1029 


ô = 





= 4247 


How precise is this estimate? Do we have faith in the accuracy of the first, second, 
third, or fourth decimal place? We will address these questions by using the boot- 
strap to estimate the sampling distribution and the standard error of Ó. The bootstrap 
logic is as follows: If 0 were known, then the three multinomial cell probabilities, 
(1 — 6)?, 20(1 — 0), and 8?, would be known. To find the sampling distribution of 0, 
we could simulate many multinomial random variables with these probabilities and 
n — 1029, and for each we could form an estimate of 0. A histogram of these estimates 
would be an approximation to the sampling distribution. Since, of course, we don't 
know the actual value of 0 to use in such a simulation, the bootstrap principle tells us 
to use 6 = .4247 in its place. With this estimated value of 0 the three cell probabilities 
(M, MN, N)are .331, .489, and .180. One thousand multinomial random counts, each 
with total count 1029, were simulated with these probabilities (see problem 35 at the 
end of the chapter for the method of generating these random counts). From each of 
these 1000 computer "experiments," a value 0* was determined. A histogram of the 
estimates (Figure 8.7) can be regarded as an estimate of the sampling distribution of 
Ê. The estimated standard error of Ó is the standard deviation of these 1000 values: 
$5 = 011. E 


Large Sample Theory for Maximum Likelihood Estimates 


In this section we develop approximations to the sampling distribution of maximum 
likelihood estimates by using limiting arguments as the sample size increases. The 
theory we shall sketch shows that under reasonable conditions, maximum likelihood 
estimates are consistent. We also develop a useful and important approximation for 
the variance of a maximun likelihood estimate and argue that for large sample sizes, 
the sampling distribution is approximately normal. 

The rigorous development of this large sample theory is quite technical; we will 
simply state some results and give very rough, heuristic arguments for the case of 
an i.i.d. sample and a one-dimensional parameter. (The arguments for Theorems A 
and B may be skipped without loss of continuity. Rigorous proofs may be found in 
Cramér (1946).) 
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FIGURE 8.7 Histogram of 1000 simulated maximum likelihood estimates of 6 
described in Example A. 


For an i.i.d. sample of size n, the log likelihood is 


10) = V log f (x19) 


i=l 
We denote the true value of 0 by @. It can be shown that under reasonable conditions 


6 is a consistent estimate of 65; that is, Ê converges to 6, in probability as n approaches 
infinity. 


THEOREM A 


Under appropriate smoothness conditions on f, the mle from an i.i.d. sample is 
consistent. 


Proof 


The following is merely a sketch of the proof. Consider maximizing 


1 ES 
10) =- D f(Xil9) 
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As n tends to infinity, the law of large numbers implies that 


10) = E log f (X|0) 


= foe f (xO) f (x|00) dx 


It is thus plausible that for large n, the 0 that maximizes /(0) should be close 
to the 9 that maximizes E log f (X|0). (An involved argument is necessary to 
establish this.) To maximize E log f (X|0), we consider its derivative: 


a 
V log f (x10) f (x|89) dx = ae f (x10) 


38 TEID f (x|00) dx 


If 0 = 05, this equation becomes 
T (89) d "HET (89) d E 0 
UA E SS e A P a — es = 
Ta 30 : 30 
which shows that 6o is a stationary point and hopefully a maximum. Note that 


we have interchanged differentiation and integration and that the assumption of 
smoothness on f must be strong enough to justify this. B 


We will now derive a useful intermediate result. 


LEMMAA 
Define 7 (0) by 


al 2 
I@)=E s log saw] 


Under appropriate smoothness conditions on f, / (0) may also be expressed as 


2 


3 
MOREE EP log saw] 


Proof 
First, we observe that since T Jf Gs] de = ds 


Z | row dx=0 


Combining this with the identity 


ð 0 
mg tm E log fal] f (x|8) 
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Wwe have 
0 — UA X 0 dx = are lo J X 0 X 0 dx 


where we have interchanged differentiation and integration (some assumptions 
must be made in order to do this). Taking second derivatives of the preceding 
expressions, we have 


ð 


a 
0— 35 EIS roo] f l0) dx 


2 2 
= f [5 oere] fGI0) ax+ f [ag rere] f(G18) dx 


From this, the desired result follows. E 


The large sample distribution of a maximum likelihood estimate is approximately 
normal with mean 6, and variance 1/[nJ (09)]. Since this is merely a limiting result, 
which holds as the sample size tends to infinity, we say that the mle is asymptot- 
ically unbiased and refer to the variance of the limiting normal distribution as the 
asymptotic variance of the mle. 


THEOREM B 
Under smoothness conditions on f , the probability distribution of Vn I (09) (0 — 09) 


tends to a standard normal distribution. 


Proof 


The following is merely a sketch of the proof; the details of the argument are 
beyond the scope of this book. From a Taylor series expansion, 


0 =1'() ~ I (69) + (Ô — 69)1" (6) 


^ —l'(09) 
0 — 0o) &2 ———— 
( o) I (8) 
, -n= (9) 
Jmm vs ee 
gp o) EA 


First, we consider the numerator of this last expression. Its expectation is 
n a 
E[n ^[(6))] 2 n^? $E | — log f (Xi 0 
[n "^1 @)] =n 2. zg 28 f (Xil) 


= 0 
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as in Theorem A. Its variance is 
es à 2 
Var[n e= =S E|—1 X;|0 
ar[n "1 (89)] 23 zg 108 f Xile) 


= I(&)) 


Next, we consider the denominator: 
1 jc oe 
= (Go) — — | ale, 
Oy > 502 og f (xi|&o) 


By the law of large numbers, the latter expression converges to 


9? 
E ES et) = —I(09) 


from Lemma A. 








We thus have 
7 U) 
vea aa TERR NOR 
SA T 
Therefore, 
E[n ^ (0 — 6] ~ 0 
Furthermore, 
n I (09) 
Var[n!? (0 — 69)] ~ 
arn — 6] o 
ES. 
~ I (6) 
and thus 
Var(Ó — 65) £z ——— 
ar( o) SES 


The central limit theorem may be applied to /'(05), which is a sum of i.i.d. 
random variables: 


n 


0 
LONS —— || X;|0 
(09) » 98, og f (X;|0) = 


Another interpretation of the result of Theorem B is as follows. For an i.i.d. sam- 
ple, the maximum likelihood estimate is the maximizer of the log likelihood function, 


10) = M log f (X;l0) 


i-l 
The asymptotic variance is 
$c. 1 
nl(0j)  El'(05) 





8.5.3 


EXAMPLE A 
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when £l" (69) is large, /(@) is, on average, changing very rapidly in a vicinity of 4 
and the variance of the maximizer is small. 

A corresponding result can be proved from the multidimensional case. The vector 
of maximum likelihood estimates is asymptotically normally distributed. The mean 
of the asymptotic distribution is the vector of true parameters, 69. The covariance of 
the estimates 6; and Ó ; is given by the ij entry of the matrix n^! I! (69), where I (0) 
is the matrix with ij component 

2 


ð ð ð 
E |g loe / XD 55 log saw) =-E | 555; O81 XO) 
Since we do not wish to delve deeply into technical details, we do not specify the 
conditions under which the results obtained in this section hold. Itis worth mentioning, 
however, that the true parameter value, 09, is required to be an interior point of the set of 
all parameter values. Thus the results would not be expected to apply in Example D of 
Section 8.5 if oy = 1, for example. It is also required that the support of the density 
or frequency function f (x|@) [the set of values for which f (x|0) > 0] does not depend 
on @. Thus, for example, the results would not be expected to apply to estimating 0 from 
a sample of random variables that were uniformly distributed on the interval [0, 0]. 
The following sections will apply these results in several examples. 


Confidence Intervals from Maximum 
Likelihood Estimates 


In Chapter 7, confidence intervals for the population mean jz were introduced. Re- 
call that the confidence interval for u was a random interval that contained jz with 
some specified probability. In the current context, we are interested in estimating the 
parameter 0 of a probability distribution. We will develop confidence intervals for 6 
based on Ó; these intervals serve essentially the same function as they did in Chapter 7 
in that they express in a fairly direct way the degree of uncertainty in the estimate Ó. A 
confidence interval for 0 is an interval based on the sample values used to estimate 0. 
Since these sample values are random, the interval is random and the probability that 
it contains 0 is called the coverage probability of the interval. Thus, for example, a 
90% confidence interval for 0 is a random interval that contains 0 with probability .9. 
A confidence interval quantifies the uncertainty of a parameter estimate. 

We will discuss three methods for forming confidence intervals for maximum 
likelihood estimates: exact methods, approximations based on the large sample prop- 
erties of maximum likelihood estimates, and bootstrap confidence intervals. The con- 
struction of confidence intervals for parameters of a normal distribution illustrates the 
use of exact methods. 


We found in Example B of Section 8.5 that the maximum likelihood estimates of u 
and c? from an i.i.d. normal sample are 
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A confidence interval for jz is based on the fact that 


where f,.., denotes the ¢ distribution with n — 1 degrees of freedom and 


1 n - 
S= Ya- 
i=1 





n—1^4 


(see Section 6.3). Let t,—ı (0/2) denote that point beyond which the ¢ distribution with 
n — | degrees of freedom has probability o//2. Since the t distribution is symmetric 
about 0, the probability to the left of —f,_;(@/2) is also w/2. Then, by definition, 


A/n(X — u) 
S 


P (^em < < em) => 1 = 


The inequality can be manipulated to yield 


= $ = 
P (x m Wr £ H < X F em) =l-a 


S 
Jn 
According to this equation, the probability that ju lies in the interval X + St, 1 (o/2)/ 
A/n is 1 — a. Note that this interval is random: The center is at the random point X 
and the width is proportional to S, which is also random. 

Now let us turn to a confidence interval for o?. From Section 6.3, 


nó? 2 
c? Xn-1 


where x2 , denotes the chi-squared distribution with n — 1 degrees of freedom. Let 
X1 (a) denote the point beyond which the chi-square distribution with m degrees of 
freedom has probability o. It then follows by definition that 


A2 
9 no 2 
P (x. — a/2) < ae ae < iae) = 1 — Q 
c? 
Manipulation of the inequalities yields 


nó? 2 nó? 
P|———zxerz-— =l-a 
Xn-1(a/2) Xy (l —a/2) 


Therefore, a 100(1 — w)% confidence interval for o? is 








A2 A2 
( nô nó ) 
X (/2) x —a/2) 
Note that this interval is not symmetric about 67—it is not of the form ó? + c, unlike 
the previous example. 


A simulation illustrates these ideas: The following experiment was done on a 
computer 20 times. A random sample of size n = 11 from normal distribution with 
mean jz = 10 and variance o? = 9 was generated. From the sample, X and ó? were 
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FIGURE 8.8 20 confidence intervals for jx (left panel) and for o? (right panel) as 
described in Example A. Horizontal lines indicate the true values. 


calculated and 90% confidence intervals for u and c? were constructed, as described 
before. Thus at the end there were 20 intervals for jz and 20 intervals for o?. The 20 
intervals for jz are shown as vertical lines in the left panel of Figure 8.8 and the 
20 intervals for o? are shown in the right panel. Horizontal lines are drawn at the 
true values u = 10 and o? = 9. Since these are 90% confidence intervals, we expect 
the true parameter values to fall outside the intervals 1096 of the time; thus on the 
average we would expect 2 of 20 intervals to fail to cover the true parameter value. 
From the figure, we see that all the intervals for jz actually cover u, whereas four of 


the intervals of o? failed to contain o?. E 


Exact methods such as that illustrated in the previous example are the exception 
rather than the rule in practice. To construct an exact interval requires detailed knowl- 
edge of the sampling distribution as well as some cleverness. A second method of 
constructing confidence intervals is based on the large sample theory of the previous 
section. According to the results of that section, the distribution of n1 (0o) (0 — 09) 
is approximately the standard normal distribution. Since 6, is unknown, we will use 
IÔ) in place of (05); we have employed similar substitutions a number of times 
before—for example, in finding an approximate standard error in Example A of Sec- 
tion 8.4. It can be further argued that the distribution of y nI (0)(ü — o) is also 
approximately standard normal. Since the standard normal distribution is symmetric 
about 0, 


P (-ca/2 < ynI(0)0 — 6) < em) mI 
Manipulation of the inequalities yields 


Ó + z(a/2) 


1 
vnl (0) 
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as an approximate 100(1 — o/)96 confidence interval. We now illustrate this procedure 
with an example. 


Poisson Distribution 
The mle of A from a sample of size n from a Poisson distribution is 

A=X 
Since the sum of independent Poisson random variables follows a Poisson distribution, 
the parameter of which is the sum of the parameters of the individual summands, 
ni = Y, X; follows a Poisson distribution with mean nd. Also, the sampling 
distribution of A is known, although it depends on the true value of A, which is 
unknown. Exact confidence intervals for A may be obtained by using this fact, and 
special tables are available (Pearson and Hartley 1966). 

For large samples, confidence intervals may be derived as follows. First, we 
need to calculate Z (4). Let f(x|A) denote the probability mass function of a Poisson 
random variable with parameter A. There are two ways to do this. We may use the 
definition 


J 2 
10) = E | g s] 


We know that 


log f (x|A) = x logå — à — log x! 


X 2 
1a)=£(*-1) 


Rather than evaluate this quantity, we may use the alternative expression for / (A) 
given by Lemma A of Section 8.5.2: 


and thus 








2 
IQA)=-E E log saw] 
Since 
a? 
3j; los f (XI) =~ 33 
I (X) is simply 
E(X) 1 
à X 


Thus, an approximate 100(1 — w)% confidence interval for A is 


= X 
X+ em 


Note that the asymptotic variance is in fact the exact variance in this case. The 
confidence interval, however, is only approximate, since the sampling distribution of 
X is only approximately normal. 





EXAMPLEC 
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As a concrete example, let us return to the study that involved counting asbestos 
fibers on filters, discussed earlier. In Example A in Section 8.4, we found À = 24.9. 
The estimated standard error of A is thus (n = 23) 


= 1.04 


Sj = 


3| 


An approximate 90% confidence interval for A is 


Î + 1.65s; 





or (23.2, 26.6). This interval gives a good indication of the uncertainty inherent in 
the determination of the average asbestos level using the model that the counts in the 
grid squares are independent Poisson random variables. E 


In a similar way, approximate confidence intervals can be obtained for parameters 
estimated from random multinomial counts. The counts are not 1.1.d., so the variance 
of the parameter estimate is not of the form 1/[n7 (0)]. However, it can be shown that 


x 1 1 
Var) 5 gU Gy ^ EW Gol 


and the maximum likelihood estimate is approximately normally distributed. Exam- 
ple C illustrates this concept. 





Hardy-Weinberg Equilibrium 
Let us return to the example of Hardy-Weinberg equilibrium discussed in Example A 
in Section 8.5.1. There we found Ó — .4247. Now, 
2X, +X, 2X% +X: 
1—8 0 
In order to calculate E[//(0)?], we would have to deal with the variances and covari- 


ances of the X;. This does not look too inviting; it turns out to be easier to calculate 
E [l" (8)]. 


l'(0) = 





2X4 X IGM 
(1— 6)? 62 


Since the X; are binomially distributed, we have 





l"(0) = 


E(X) = n(1— 6)" 
E(X>) = 2n0 (1 — 6) 


E(X3) = n9? 
We find, after some algebra, that 
EU") 2 -— 
— 6(1— 0) 


Since 0 is unknown, we substitute Ó in its place and obtain the estimated standard 
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error of 0: 





An approximate 95% confidence interval for 0 is 6+ 1.96sg, or (.403, .447). (Note 
that this estimated standard error of Ó agrees with that obtained by the bootstrap in 
Example 8.5.1A.) L| 


Finally, we describe the use of the bootstrap for finding approximate confidence 
intervals. Suppose that Ó is an estimate of a parameter 0—the true, unknown value 
of which is 0j—and suppose for the moment that the distribution of A = Ê — 6, is 
known. Denote the w/2 and 1 — œ /2 quantiles of this distribution by 6 and ô; i.e., 


^ o 
PO-%<d =5 

^ — a 
PÓ-&s5-1-7 


Then 
P(S<0-H <5)=1-a 
and from manipulation of the inequalities, 


P(6—5<&<6—-S)=1-a 


The preceding assumed that the distribution of Ó — 6) was known, which is 
typically not the case. If 6) were known, this distribution could be approximated 
arbitrarily well by simulation: Many, many samples of observations could be randomly 
generated on a computer with the true value 00; for each sample, the difference 6- Oo 
could be recorded; and the two quantiles 6 and 6 could, consequently, be determined 
as accurately as desired. Since 6, is not known, the bootstrap principle suggests using 
Ó in its place: Generate many, many samples (say, B in all) from a distribution with 
value 0; and for each sample construct an estimate of 0, say 65, j—1,2,;..4.B.Thé 
distribution of Ê — 6; is then approximated by that of 0* — 6, the quantiles of which are 
used to form an approximate confidence interval. Examples may make this clearer. 


We first apply this technique to the Hardy-Weinberg equilibrium problem; we will 
find an approximate 9596 confidence interval based on the bootstrap and compare the 
result to the interval obtained in Example C, where large-sample theory for maxi- 
mum likelihood estimates was used. The 1000 bootstrap estimates of 0 of Example A 
of Section 8.5.1 provide an estimate of the distribution of 6*; in particular the 25th 
largest is .403 and the 975th largest is .446, which are our estimates of the .025 and 


EXAMPLEE 


8.6 


8.6 The Bayesian Approach to Parameter Estimation 285 


.975 quantiles of the distribution. The distribution of 0* — 6 is approximated by sub- 
tracting Ó = .425 from each 0*, so the .025 and .975 quantiles of this distribution are 
estimated as 


ô = .403 — .425 = —.022 
à = 446 — .425 = .021 


Thus our approximate 95% confidence interval is 
(6 — 5,0 — 8) = (404, .447) 


Since the uncertainty in is in the second decimal place, this interval and that found 
in Example C are identical for all practical purposes. E 


Finally, we apply the bootstrap to find approximate confidence intervals for the 
parameters of the gamma distribution fit in Example C of Section 8.5. Recall that 
the estimates were @ = .471 and Â = 1.97. Of the 1000 bootstrap values of 
œ“, o, 05, . .. , Aioog, the 50th largest was .419 and the 950th largest was .538; the 
.05 and .95 quantiles of the distribution of a@* — à are approximated by subtracting à 
from these values, giving 


8 = aH — .471 = 052 
5 = .538 — .471 = .067 


Our approximate 90% confidence interval for o is thus 
(& — 6, â — 8) = (.404, .523) 


The 50th and 950th largest values of 4* were 1.619 and 2.478, and the corresponding 
approximate 9046 confidence interval for Ao is (1.462, 2.321). E 


We caution the reader that there are a number of different methods of using the 
bootstrap to find approximate confidence intervals. We have chosen to present the 
preceding method largely because the reasoning leading to its development is fairly 
direct. Another popular method, the bootstrap percentile method, uses the quantiles 
of the bootstrap distribution of Ó directly. Using this method in the previous example, 
the confidence interval for o would be (.419, .538). Although this direct equation 
of quantiles of the bootstrap sampling distribution with confidence limits may seem 
initially appealing, its rationale is somewhat obscure. If the bootstrap distribution is 
symmetric, the two methods are equivalent (see Problem 38). 


The Bayesian Approach 
to Parameter Estimation 


A preview of the Bayesian approach was given in Example E of Section 3.5.2, which 
should be reviewed before continuing. 
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In the Bayesian approach, the unknown parameter 0 is treated as a random vari- 
able, with “prior distribution" f&(0) representing what we know about the parameter 
before observing data, X. In the following, we assume © is a continuous random 
variable; the discrete case is entirely analogous. This model is in contrast to the ap- 
proaches described in the previous sections, in which 0 was treated as an unknown 
constant. For a given value, © = 0, the data have the probability distribution (density 
or probability mass function) fxjo(x|0). The joint distribution of X and © is thus 


fx,e(x, 0) = fxio(x|80) fo(8) 


and the marginal distribution of X is 
fs) = | fes odo 


= | toot foods 
The distribution of © given the data X is thus 


fx.o(x, 0) 
fx (x) 
|... fxie(xI0) fo@) 
J fxie (x10) fo(0)d0 
This is called the posterior distribution; it represents what is known about © having 


observed data X. Note that the likelihood is fxjo (x|0), viewed as a function of 0, and 
we may usefully summarize the preceding result as 


feix (O|x) = 





feix (01x) « fxie(x|0) x fe(8) 


Posterior density « Likelihood x Prior density 


The Bayes paradigm has an appealing formal simplicity as it involves elementary 
probability operations. We will now see what it amounts to for examples we considered 
earlier. 


Fitting a Poisson Distribution 
Here the unknown parameter is A, which has a prior distribution f, (A), and the data 
are n 1.1.d. observations X4, X2,..., Xn, which fora given value A are Poisson random 
variables with 

AS e^ 


ci 
xj! 





Jxaa ila) = ; x; 20,1,2,... 


Their joint distribution given A is (from independence) the product of their marginal 
distributions given A 
APR g-n 


fxi QA) = That 
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where X denotes (X1, X2,..., Xn). The posterior distribution of A given X is then 


ANS e fy (A) 


fux x) = T Ait emm f, (A) dA 





(the term [ [7 , x;! has cancelled out). 

Thus, to evaluate the posterior distribution, we have to do two things: spec- 
ify the prior distribution fa(à) and carry out the integration in the denominator of 
the preceding expression. For illustration, we consider the data of Examples 8.4A 
and 8.5A. 

We will consider two approaches to specifying the prior distribution. The first 
is that of an orthodox Bayesian who takes very seriously the model that the prior 
distribution specifies his prior opinion. Note that this specification should be done 
before seeing the data, X, and he is required to provide the probability density f4(A) 
through introspection. This is not an easy task to carry out, and even the orthodox often 
compromise for convenience. He thus decides to quantify his opinion by specifying a 
prior mean jz; = 15 and standard deviation o. = 5 and to use, because the math works 
out nicely as we will see, a gamma density with that mean and standard deviation. 
This choice could be aided by plotting gamma densities for various parameter values. 
The prior density is shown in Figure 8.9. Using the relationships developed in Exam- 
ple C in Section 8.4, the second moment is u2 = u? + o? = 250 and the parameters 
of the gamma density are 


0.4 - 


03r 











FIGURE 8.9 First statistician's prior (solid) and posterior (dashed). Second 
statistician's posterior (dotted). 
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(We denote the parameter by v rather than by the usual A since A has already been 
used for the parameter of the Poisson distribution.) The prior distribution for A is then 
pe 

j= no} e 
fr) = co^ (a) 
After some cancellation, the posterior density is 
Exita-l6—Q-v)A 


i AXxita-l eumd 





fux lx) = 


Now, consider this an important trick that is used time and again in Bayesian calcula- 
tions: the denominator is a constant that makes the expression integrate to 1. We can 
deduce from the form of the numerator that the ratio must be a gamma density with 
parameters 


a! = M x +a = 582 
v =n + v = 23.6 


This standard trick allows the statistician to avoid having to do any explicit integra- 
tion. (Make sure you understand it, because it will occur again several times.) The 
posterior density is shown in Figure 8.9. Compare it to the prior distribution to observe 
how observation of the data, X, has drastically changed his state of knowledge about 
A. Notice that the posterior density is much more symmetric and looks like a normal 
density (that this is no accident will be shown later). H 


According to the Bayesian paradigm, all the information about A is contained in 
the posterior distribution. The mean of this distribution (the posterior mean) is 


a’ 
Hpos = = 24.7 
The most probable value of A, the posterior mode, is 24.6. (Verify that the gamma 
density is maximized at (a — 1)/v.) Either of these two values could be used as a 
point estimate of the unknown mean of the Poisson distribution, if a single number is 
required. 
The variance of the posterior distribution is 


RS NE 
post — 5 — ^ 


and the posterior standard deviation is op. = 1.02, which is a simple measure of 
variability—the posterior distribution of A has mean 24.7 and standard deviation 
1.02. A Bayesian analogue of a 9096 confidence interval is the interval from the 5th 
percentile to the 95th percentile of the posterior, which can be found numerically to 
be [23.02, 26.34]. A common alternative to this interval is a high posterior density 
(HPD) interval, formed as follows: Imagine placing a horizontal line at the high point 
of the posterior density and moving it downward until the interval of A formed below 
where the line cuts the density contained 9046 probability. If the posterior density is 
symmetric and unimodal, as is nearly the case in Figure 8.9, the HPD interval will 
coincide with the interval between the percentiles. 
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The second statistician takes a more utilitarian, noncommittal approach. She 
believes that it is implausible that the mean count à could be larger than 100, and 
uses a simple prior that is uniform on [0, 100], without trying to quantify her opinion 
more precisely. The posterior density is thus 


Reet ew ol. 
fux (dx) = 100 100 , 0 < À < 100 


1 n 
Eu ANS e dA 
100 Jo 





The denominator has to be integrated numerically, but this is easy to do for such 
a smooth function. The resulting posterior density is shown in Figure 8.9. Using 
numerical evaluations, she finds that the posterior mode is 24.9, the posterior mean 
is 25.0, and the posterior standard deviation is 1.04. The interval from the 5th to the 
95th percentile is [23.3, 26.7]. 

We now compare these two results to each other and to the results of maximum 
likelihood analysis. 








Estimate Bayes 1 Bayes 2 Maximum Likelihood 
mode 24.6 24.9 24.9 

mean 24.7 25.0 — 

standard deviation 1.02 1.04 1.04 

upper limit 26.3 26.7 26.6 

lower limit 23.0 23.3 23.2 





Comparing the results of the second Bayesian to those of maximum likelihood, 
it is important to realize that her posterior density is directly proportional to the like- 
lihood for 0 < X < 100, because the prior is flat over this range and the posterior is 
proportional to the prior times the likelihood. Thus, her posterior mode and the max- 
imum likelihood estimate are identical. There is no such guarantee that her posterior 
standard deviation and the approximate standard error of the maximum likelihood 
estimate are identical, but they turn out to be, to the number of significant figures 
displayed in the table. The two 90% intervals are very close. 

Now compare the results of the first and second Bayesians. Observe that although 
his prior opinion was not in accord with the data, the data strongly modified the prior, 
to produce a posterior that is close to hers. Even though they start with quite different 
assumptions, the data forces them to very similar conclusions. His prior opinion has 
indeed influenced the results: his posterior mean and mode are less than hers, but 
the influence has been mild. (If there had been less data or if his prior opinions 
had been much more biased to low values, the results would have been in greater 
conflict.) The fundamental result that the posterior is proportional to the prior times 
the likelihood helps us to understand the difference: the likelihood is substantial only 
in the region approximately between à = 22 and A = 28. (This can be seen in the 
figure, because the second statistician's posterior is proportional to the likelihood. 
See Figure 8.5, also). In this region, his prior decreases slowly, so the posterior is 
proportional to a weighted version of the likelihood, with slowly decreasing weight. 
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The first Bayesian's posterior thus differs from the second by being pushed up slightly 
on the left and pulled down on the right. 

Although they are very similar numerically, there is an important difference 
between the Bayesian and frequentist interpretation of the confidence intervals. In 
the Bayesian framework, A is a random variable and it makes perfect sense to say, 
"Given the observations, the probability that A is in the interval [23.3, 26.7] is 0.90." 
Under the frequentist framework, such a statement makes no sense, because A is a 
constant, albeit unknown, and it either lies in the interval [23.3, 26.7] or doesn't —no 
probability is involved. Before the data are observed, the interval is random, and it 
makes sense to state that the probability that the interval contains the true parameter 
value is 0.90, but after the data are observed, nothing is random anymore. One way to 
understand the difference of interpretation is to realize that in the Bayesian analysis 
the interval refers to the state of knowledge about A and not to A itself. 

Finally, we note that an alternative for the second statistician would have been to 
use a gamma prior because of its analytical convenience, but to make the prior very 
flat. This can be accomplished by setting o and A to be very small. 


Normal Distribution 
It is convenient to reparametrize the normal distribution, replacing o? by £ = 1/o?; 
& is called the precision. We will also use 0 in place of jz. The density is then 


E 1/2 1 ; 
fG|0, £) = (=) exp (-5:- 6 ) 


The normal distribution has two parameters, and we will consider cases of Bayesian 
analysis depending on which of them are known and unknown. E 


Case of Unknown Mean and Known Variance 

We first consider the case in which the precision is known, € = &ọ and the mean, 0, 
is unknown. In the Bayesian treatment, the mean is a random variable, ©. It is mathe- 
matically convenient to use a prior distribution for ©, which is N (6, Ex This prior 
is very flat, or uninformative, when Ẹprior is very small, i.e., when the prior variance 
is very large. Thus, if X = (X4, X2, ..., Xn) are independent given 0 


feix(0|x) x fxio (x10) x fo(@) 
_ ( ^ pres =i ql gy x prior < 
"Ux HPP Qn 
x exp (He — a) 


1 n 
« exp (-; la Gi = oy. + Epio (0 = i] 
i=l 


Here we have exhibited only the terms in the posterior density that depend upon 0; 
the last expression above shows the shape of the posterior density as a function of 0. 
The posterior density itself is proportional to this expression, with a proportionality 
constant that is determined by the requirement that the posterior density integrates to 1. 
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We will now manipulate the expression for the numerator to cast it in a form so 
that we can recognize that the posterior density is normal. Expressing $ (x; — 0)? = 
3 - x)? -- n(0 — xX)’, and absorbing more terms that do not depend on 6 into the 
constant of proportionality (a typical move in Bayesian calculations), we find 


1 2 
fex (0|x) x exp (-;me — X) + Epio (0 — 01) 


Now, observe that this is of the form exp(—(1/2) Q(0)), where Q(0) is a quadratic 
polynomial. We can find expressions po and 055,4, and write 


Q(0) = &y«(0 — a + terms that do not depend on 0 


and conclude that the posterior density is normal with posterior mean post and pos- 
terior precision £594. Again, terms that do not depend on 0 do not affect the shape of 
the posterior density and are absorbed in the normalization constant that makes the 
posterior density integrate to 1. Thus we expand Q(@) and identify the coefficient of 
6? as the posterior precision and the coefficient of —@ as twice the posterior mean 
times the posterior precision. Doing so, we find 


Epost = néo F E prior 


néoX t ME prior 
néo F Éprior 

néo Éprior 
n& + Éprior ^ néo + prior 
The posterior density of 0 is thus normal with this mean and precision. Note that the 
precision has increased and that the posterior mean is a weighted combination of the 
sample mean and the prior mean. 

To interpret these results, consider what happens when prior K ngo, which would 
be the case if n were sufficiently large of if Eprior were small (as for a very flat prior). 
Then the posterior mean would be 


post = 


EST 





Apost xX 
which is the maximum likelihood estimate, and 


Epost x néo 


This last equation can be written as 05... = 0j /n, which is just the variance of X in 
the non-Bayesian setting. In summary, if the flat prior with very small &prior is used, 
the posterior density of 0 is very close to normal with mean X and variance og /n. 


Case of Known Mean and Unknown Variance 

In this case, the precision is unknown and is treated as a random variable E, with 
prior distribution fz(É). Given £, the X; are independent N (6o, €71). Let X = 
(Xi, X», T X53). Then 


faxGlx) c faal) fa (5) 
1 
o E"? exp (-5: N 08 = Ay) fa) 
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Observing how the density depends on &, we realize that it is analytically convenient 


m 


to specify the prior to be a gamma density: € ~ TF (œ, A). Then 


1 
fax Elx) x £"? exp (-5: 3c - y) poa 
which is a gamma density with parameters, 


n 
Qpost = A + = 


2, 
1 
Apost = A+ P] yc — 6 


In the case of a flat prior (small o and A), the posterior mean and mode are 


1 
Posterior mean % — x; — 00)" 
" 4 i o) 


1 2 
Posteri de ~ —— i — 09)* 
osterior mode 2-3 ye o) 
The former is the maximum likelihood estimate of o?. In the limit, A > 0, a > 0, 


5 1 
fax (Elx) ex £"! exp (-5: b» = w?) im] 


Case of Unknown Mean and Unknown Variance 
In this case, there are two unknown parameters, and a Bayesian approach requires 
the specification of a joint two-dimensional prior distribution. We follow a path of 
mathematical convenience and take the priors to be independent: 
-1 
© ~ N (6o. £i.) 


=~ T(a, A) 
We then have 


fo,six (0, Elx) x fxie, 2x0, E) fe(0) fz(&) 
x &"? exp (-3 Soi = e) 


x exp (-&o i a) E°! exp(-A£) 


From the manner in which 0 and & occur in the first exponential, it appears that the 
two variables are not independent in the posterior even though they were in the prior. 
To evaluate this joint posterior density, we would have to find the constant of propor- 
tionality that makes it integrate to 1—the normalization constant. Two dimensional 
numerical integration could be used. 

Often the primary interest is in the mean, 0, and one useful aspect of Bayesian 
analysis is that information about 0 can be “marginalized” by integrating out £: 


factio = | fe,zix (8, &|x)dé 


EXAMPLEC 
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Examining the preceding expression for fo,z;x (0, |x) as a function of £, we see 
that it is of the form of a gamma density, with parameters à = o + n/2 and A = 
A+ 0/2) 3; — 0)?, so we can evaluate the integral. We thus find 


T(a+n/2) 
[A+ 1 Gi — gy? 
This is not a density that we recognize, but it could be evaluated numerically. Doing 
so would again entail finding the normalizing constant, which could be done by 


numerical integration. Some simplifications occur when n is large or when the prior 
is quite flat (œ, X, Eprior are small). Then 


fob) x (SV) " 


This posterior is maximized when 5 (x; — 0)? is minimized, which occurs at 0 = 
Xx. We can relate this to the result we found for maximum likelihood analysis by 
expressing 





fos 919 œ exp ( Sr co wd) 


DiGi — 6° = J oi- +06 - 3Y 
= (n — 1)s? + n(0 — xy 
n(0 — xy 
V 


Substituting this above and absorbing terms that do not depend on 0 into the propor- 
tionality constant, we find 


= n= 0s? (14 


n—l 5 


1 n(0—xpN "^ 
feix (0|x) « (: + =) 


Now comparing this to the definition of the f distribution (Section 6.2), we see that 


J/n(8 — x) 
ET = ~ fd 
corresponding to the result from maximum likelihood analysis. 

The interval X +t,- (œ/2)s / y/n was earlier derived as a 100(1 — œ )% confidence 
interval centered about the maximum likelihood estimate, and here it has reappeared 
in the Bayesian analysis as an interval with posterior probability 1 — o. There are 
differences of interpretation, however, just as there were for the earlier Poisson case. 
The Bayesian interval is a probability statement referring to the state of knowledge 
about 0 given the observed data, regarding 0 as a random variable. The frequentist 
confidence interval is based on a probability statement about the possible values of 
the observations, regarding 0 as a constant, albeit unknown. | 


Hardy-Weinberg Equilibrium 

We now turn to a Bayesian treatment of Example A in Section 8.5.1. We use the 
multinomial likelihood function and a prior for 0, which is uniform on [0, 1]. The 
posterior density is thus proportional to the likelihood, and is shown in Figure 8.10. 
Note that it looks very much like a normal density, a phenomenon that will be 
explored in a later section. Since fxje(x|0) is a polynomial in 0 (of high degree), 
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FIGURE 8.10 Posterior distribution of ©. 


the normalization constant can in principle be computed analytically. (Alternatively, 
all the computations can be done numerically.) 

Because the prior is flat, the posterior is directly proportional to the likelihood 
and the maximum of the posterior density is the maximum likelihood estimate, Ó = 
0.4247. The 0.025 percentile of the density is 0.404, and the 0.975 percentile is 0.446. 
These results agree with the approximate confidence interval found for the maximum 
likelihood estimate in Example C in Section 8.5.3. a 


Further Remarks on Priors 


In the previous section, we saw that if the prior for a Poisson parameter is chosen 
to be a gamma density, then the posterior is also a gamma density. Similarly, when 
the prior for a normal mean with known variance is chosen to be normal, then the 
posterior is normal as well. Earlier, in Example E in Section 3.5.2, a beta prior was 
used for a binomial parameter, and the posterior turned out to be beta as well. These 
are examples of conjugate priors: if the prior distribution belongs to a family G 
and, conditional on the parameters of G, the data have a distribution H, then G is 
said to be conjugate to H if the posterior is in the family G. Other conjugate priors 
will be the subject of problems at the end of the chapter. Conjugate priors are used 
for mathematical convenience (required integrations can be done in closed form) 
and because they can assume a variety of shapes as the parameters of the prior are 
varied. 

In scientific applications, it is usually desirable to use a flat, or “uninformative,” 
prior so that the data can speak for themselves. Even if a scientific investigator actually 
had a strong prior opinion, he or she might want to present an “objective” analysis. 
This is accomplished by using a flat prior so that the conclusions, as summarized in 
the posterior density, are those of one who is initially unopinionated or unprejudiced. 
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If an informative prior were used, it would have to be justified to the larger scientific 
community. The objective prior thus has a hypothetical, or “what if,’ status: if one 
was initially indifferent to parameter values in the range in which the likelihood is 
large, then one's opinion after observing the data would be expressed as a posterior 
proportional to the likelihood. 

Attempts have been made to formalize more precisely what the notion of an unin- 
formative prior means. One problem that is addressed is caused by reparametrization. 
For example, suppose that the prior density of the precision & is taken to be uniform on 
an interval [a, b], which might seem to be a reasonable way to quantify the notion of 
being uniformative. However, if the variance o? — 1/£, rather than the precison, was 
used, the prior density of o? would not be uniform on [b^ , a^! ]. We will not delve 
further into these issues here, except to note that the parametrization 0 or g(0) would 
make a difference only if the difference in the shapes of the priors was substantial in 
the region in which the likelihood was large. 

We saw in the Poisson example that if o and v are very small, the gamma prior 
is quite flat and the posterior is proportional to the likelihood function. Formally, if o 
and v are set equal to zero, then the prior is 


fus) mA DX = 06 


But this function does not integrate to 1—it is not a probability density. A similar 
phenomena occurs in the normal case with unknown mean and known precision, if 
the prior precision is set equal to 0. The prior is then 


fo(@) «1, —oo «0 «oco 


and not a probability density either. Such priors are called improper priors (priors 
that lack propriety). 

In general, if an improper prior is formally used, the posterior may not be a 
density either, because the denominator of the expression for the posterior density, 
J fxie (x10) fo (0) dO. may not converge. (Note that it is integrated with respect to 
0, not x.) This has not been the case in our examples. For the Poisson example, if 
fA (à) « AT, then the denominator is 


oo 
i Alea < oo 
0 
In the normal case, too, the integral is defined, and thus there is a well-defined posterior 
density. 

Let us revisit some examples using the device of an improper prior. In the Poisson 
example, using the improper prior fa (A) = A^! results in a (proper) posterior 


fux Alx) ox OD 


which can be recognized as a gamma density. 

In the normal example with unknown mean and variance, we can take 0 and & to 
be independent with improper priors fo(0) = 1 and fz(£) = &^!. The joint posterior 
of 0 and é is then 


fosx@ Els) "= exp (-3 SG -0) 
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Expressing $5; (x; — 0)? = (n — 1)s? + n(0 — x^, we have 


fo, aix (8, Elx) ox £" 771 exp (-$o — Ds?) exp (- Se - ») 


For fixed £, this expression is proportional to the conditional density of 0 given &. 
(Why?) From the form of the dependence on 0, we see that conditional on &, 0 is 
normal with mean x and precision né. By integrating out £, we can find the marginal 
distribution of 0 and relate it to the ¢ distribution as was done earlier. 

Since improper priors are not actually probability densities, they are difficult to 
interpret literally. However, the resulting posteriors can be viewed as approximations 
to those that would have arisen with extreme values of the parameters of proper 
priors. The priors corresponding to such extreme values are very flat, so the posterior 
is dominated by the likelihood. Then it is only in the range in which the likelihood is 
large that the prior makes any practical difference—truncating the improper prior well 
outside this range to produce a proper prior will not appreciably change the posterior. 


Large Sample Normal Approximation to the Posterior 


We have seen in several examples that the posterior distribution is nearly normal with 
the mean equal to the maximum likelihood estimate, and that the posterior standard 
deviation is close to the asymptotic standard deviation of the maximum likelihood 
estimate. The two methods thus often give quite comparable results. We will not give 
a formal proof here, but rather will sketch an argument that the posterior distribution 
is approximately normal with the mean equal the the maximum likelihood estimate, 
6, and variance approximately equal to —[/" (0)]!. 
Denoting the observations generically by x, the posterior distribution is 


feix(8|x) x fe(0) fxio(x|8) 
= exp[log fo(0)] exp[log fxjo(xI8)] 
= expllog fo(@)] exp[/()] 
Now, if the sample is large, the posterior is dominated by the likelihood, and in the 


region where the likelihood is large, the prior is nearly constant. Thus, to an approxi- 
mation, 


" — A NONE 
feix (81x) ex exp lr» + (0 — y) + 7 (0 -ITO 
1 AN2 6 
« exp E —0yl e 


In the last step, we used the fact that since Ó is the maximum likelihood estimate 
l (6) = 0. The term / (0) was absorbed into a proportionality constant, since we are 
evaluating the posterior as a function of 0. Finally, observe that the last expression is 
proportional to a normal density with mean Ó and variance —[/" (8)]-!. 


8.6.3 
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Computational Aspects 


Contemporary computational resources have had an enormous impact on Bayesian 
inference. As we have seen in several examples, the computationally difficult part 
of Bayesian inference is the calculation of the normalizing constant that makes the 
posterior density integrate to 1. Traditionally, such calculations were performed ana- 
lytically, often using conjugate priors so that the integrations could be done explicitly. 
The numerical integration of a well-behaved function of a small number of variables 
is now trivial. 

Difficulties do arise in high dimensional problems, however, and the integrations 
are often done by sophisticated Monte Carlo methods. We will not go into these 
sorts of methods in this book, but will hint at their nature in the following exam- 
ple of a method called Gibbs Sampling. Consider, as a simple example, inference 
for a normal distribution with unknown mean and variance. From Example B in 
Section 8.6 


fo.aix (8, Elx) x E" exp (- SC ges a) 


x exp (-Eo - A) £*- exp(—Aé) 





For simplicity, suppose that an improper prior is used: &pig; > 0, € — 0,4 — 0. 
Then 


fo.aix (8, Elx) oc £" exp ( Da- o) 


o put exp (Fe = »?) 


Here we expressed 


$ a= = ay +n — x 


and absorbed terms that do not involve 0 into the constant of proportionality. To study 
the posterior distribution of € and 0 by Monte Carlo, we would draw many pairs 
(Ek, Ok) from this joint density; the problem is how to actually do this. 

Gibbs Sampling would accomplish this in the following way. Observe that the 
expression fə, zix (0, €|x) shows that for given £, 0 is normally distributed with mean 
X and precision né. (Fix Æ in the expression and recognize a normal density in 0.) 
Also, if 0 is fixed, the density of € is a gamma density. Gibbs Sampling alternates 
back and forth between the two conditional distributions: 


. Choose an initial value 09; x would be a natural choice. 

. Generate £o from a gamma density with parameter 0j. 

. Generate 0; from a normal distribution with parameter &. 
. Generate £; from a gamma density with parameter 04. 

. Continue on in this fashion. 


1C NA 


The analysis ofthe algorithm and why it works is beyond the scope of this book. A 
"burn-in" period is required so that we might run this scheme for a few hundred steps 
before beginning to record pairs (&, 04), k = 1,..., N, which would be regarded 
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as simulated pairs from the posterior. A further complication is that these pairs are 
not independent of one another. But, nonetheless, a histogram of the collection of 6 
could be used as an estimate of the marginal posterior distribution of O. The posterior 
mean of O can be estimated as 


1 N 
E(8|X) x y 
k=1 


Efficiency and the Cramér-Rao 
Lower Bound 


In most statistical estimation problems, there are a variety of possible parameter 
estimates. For example, in Chapter 7 we considered both the sample mean and a 
ratio estimate, and in this chapter we considered the method of moments and the 
method of maximum likelihood. Given a variety of possible estimates, how would we 
choose which to use? Qualitatively, it would be sensible to choose that estimate whose 
sampling distribution was most highly concentrated about the true parameter value. 
To define this aim operationally, we would need to specify a quantitative measure 
of such concentration. Mean squared error is the most commonly used measure of 
concentration, largely because of its analytic simplicity. The mean squared error of Ó 
as an estimate of 0 is 


MSE(0) = E(6 — 0)? 
= Var(0) + (E (Ô) — 09»? 


(See Theorem A of Section 4.2.1.) If the estimate Ó is unbiased [E (6)= 09], MSE(6)= 
Var (ô). When the estimates under consideration are unbiased, comparison of their 
mean squared errors reduces to comparison of their variances, or equivalently, stan- 
dard errors. 

Given two estimates, 6 and 6, of a parameter 6, the efficiency of 6 relative to Ó 
is defined to be 





Thus, if the efficiency is smaller than 1, 6 has a larger variance than 6 has. This 
comparison is most meaningful when both 6 and @ are unbiased or when both have 
the same bias. Frequently, the variances of 6 and @ are of the form 


Var(0) = a 
n 


Var(6) = 2 

n 
where n is the sample size. If this is the case, the efficiency can be interpreted as 
the ratio of sample sizes necessary to obtain the same variance for both Ó and Ó. (In 
Chapter 7, we compared the efficiencies of estimates of a population mean from a 
simple random sample, a stratified random sample with proportional allocation, and 
a stratified random sample with optimal allocation.) 


EXAMPLEA 
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Muon Decay 
Two estimates have been derived for o in the problem of muon decay. The method of 
moments estimate is 
@= 3X 
The maximum likelihood estimate is the solution of the nonlinear equation 


n 


X; 
2o 155g 79 


i=1 





We need to find the variances of these two estimates. 
Since the variance of a sample mean is o?/n, we compute o?: 


c? = E(X’) - [EGO 





f 1 ol +x o? 
= x —— 
E 2 9 
u 1 o 
|. 9 
Thus, the variance of the method of moments estimate is 
3-o? 


Var(@) = 9 Var(X) = 





The exact variance of the mle, 6, cannot be computed in closed form, so we approxi- 
mate it by the asymptotic variance, 


Var(&) © 





AI (o) 
and then compare this asymptotic variance to the variance of à. The ratio of the former 
to the latter is called the asymptotic relative efficiency. By definition, 


2 


I(@)=E PI feo] 
da 


f. x? lc ox 
= dx 
_; (1 + ax)? 2 








1 
log (722) - 20 
= = , l<a<l,a #0 
1 
= a=0 
3 


The asymptotic relative efficiency is thus (for a zz 0) 


Var(@) 20° 1 
~ = _ 2 1 
Var(@) 3-a ve ( +) E 
o 





300 


Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 


The following table gives this efficiency for various values of a between 0 and 1; 
symmetry would yield the values between — 1 and 0. 





a Efficiency 





1.0 
.997 
.989 
975 
.953 
.931 
.878 
.817 
727 
582 

5 .464 


CRCR ENE SE ES) 





As a tends to 1, the efficiency tends to 0. Thus, the mle is not much better than the 
method of moments estimate for o close to 0 but does increasingly better as o tends 
to 1. 

It must be kept in mind that we used the asymptotic variance of the mle, so we 
calculated an asymptotic relative efficiency, viewing this as an approximation to the 
actual relative efficiency. To gain more precise information for a given sample size, 
a simulation of the sampling distribution of the mle could be conducted. This might 
be especially interesting for œ = 1, a case for which the formula for the asymptotic 
variance given above does not appear to make much sense. With a simulation study, 
the behavior of the bias as n and o vary could be analyzed (we showed that the mle 
is asymptotically unbiased, but there may be bias for a finite sample size), and the 
actual distribution could be compared to the approximating normal. E 


In searching for an optimal estimate, we might ask whether there is a lower bound 
for the MSE of any estimate. If such a lower bound existed, it would function as a 
benchmark against which estimates could be compared. If an estimate achieved this 
lower bound, we would know that it could not be improved upon. In the case in which 
the estimate is unbiased, the Cramér-Rao inequality provides such a lower bound. We 
now state and prove the Cramér-Rao inequality. 


THEOREM A Cramér-Rao Inequality 


ILE 25 5554 2 WE ore, viaa: aerei ime jf (alles Ibit IP mm A osos 252) 
be an unbiased estimate of 0. Then, under smoothness assumptions on f (x|0), 


1 
Var(T) > O 
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Proof 
Let 


n a 
Z= 2 5s LE / X10) 


ic (8) 


-E Eme f (X18) 


In Section 8.5.2, we showed that E(Z) = 0. Because the correlation coefficient 
of Z and T is less than or equal to | in absolute value 


Cov?(Z, T) < Var(Z)Var(T) 
It was also shown in Section 8.5.2 that 
0 
Var l X|0)| = I (0 
E g 98 f(X| ) (0) 
Therefore, 
Var(Z) = nI(0) 


The proof will be completed by showing that Cov(Z, T) — 1. Since Z has 
mean 0, 


Cov(Z, T) = E(ZT) 
0 
"T flo | e 
Sn T TEE 
-j Jine) LO Maoa 


Noting that 


LET ya n 
PTT Tse - a; LL fu 


i=l 


we rewrite the expression for the covariance of Z and T as 


Cov(Z, D= f fissis I reso dx; 
i=l 
= a Jo [æ ro dx; 


0 9 
= ag EOD = 3; =1 


which proves the inequality. [Note the interchange of differentiation and integra- 
tion that must be justified by the smoothness assumptions on f(x|@).] L| 
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Theorem A gives a lower bound on the variance of any unbiased estimate. An 
unbiased estimate whose variance achieves this lower bound is said to be efficient. 
Since the asymptotic variance of a maximum likelihood estimate is equal to the 
lower bound, maximum likelihood estimates are said to be asymptotically efficient. 
For a finite sample size, however, a maximum likelihood estimate may not be ef- 
ficient, and maximum likelihood estimates are not the only asymptotically efficient 
estimates. 


Poisson Distribution 
In Example B in Section 8.5.3, we found that for the Poisson distribution 


1 


Therefore, by Theorem A, for any unbiased estimate T of A, based on a sample of 
independent Poisson random variables, X,,..., Xn, 


À 
Var(T) > — 
n 


The mle of A was found to be X = S/n, where $ = X, 4---- -- X, Since S follows a 
Poisson distribution with parameter nA, Var(S) = nA and Var(X) = 4/n. Therefore, 
X attains the Cramér-Rao lower bound, and we know that no unbiased estimator of À 
can have a smaller variance. In this sense, X is optimal for the Poisson distribution. 
But note that the theorem does not preclude the possibility that there is a biased 
estimator of A that has a smaller mean squared error than X does. E 


An Example: The Negative Binomial Distribution 


The Poisson distribution is often the first model considered for random counts; it 
has the property that the mean of the distribution is equal to the variance. When it is 
found that the variance of the counts is substantially larger than the mean, the negative 
binomial distribution is sometimes instead considered as a model. We consider a 
reparametrization and generalization of the negative binomial distribution introduced 
in Section 2.1.3, which is a discrete distribution on the nonnegative integers with a 
frequency function depending on the parameters m and k: 


m\-k T (k +x) m M 
f(x|m,k) = (1 7 =) xII (k) aw 


The mean and variance of the negative binomial distribution can be shown 
to be 





“=m 
2 QE 
o“ =m + — 

k 


It is apparent that this distribution is overdispersed (o? > 44) relative to the Poisson. 
We will not derive the mean and variance. (They are most easily obtained by using 
moment-generating functions.) 
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The negative binomial distribution can be used as a model in several cases: 


* Ifk is an integer, the distribution of the number of successes up to the kth failure in a 
sequence of independent Bernoulli trials with probability of success p = m/(m + k) 
is negative binomial. 

e Suppose that A is a random variable following a gamma distribution and that for À, 
a given value of A, X follows a Poisson distribution with mean å. It can be shown 
that the unconditional distribution of X is negative binomial. Thus, for situations in 
which the rate varies randomly over time or space, the negative binomial distribution 
might tentatively be considered as a model. 

* The negative binomial distribution also arises with a particular type of clustering. 
Suppose that counts of colonies, or clusters, follow a Poisson distribution and that 
each colony has a random number of individuals. If the probability distribution 
of the number of individuals per colony is of a particular form (the logarithmic 
series distribution), it can be shown that the distribution of counts of individuals is 
negative binomial. The negative binomial distribution might be a plausible model 
for the distribution of insect counts if the insects hatch from depositions, or clumps, 
of larvae. 

* The negative binomial distribution can be applied to model population size in a 
certain birth/death process, the assumption being that the birth rate and death rate 
per individual are constant and that there is a constant rate of immigration. 


Anscombe (1950) discusses estimation of the parameters m and k and compares 
the efficiencies of several methods of estimation. The simplest method is the method 
of moments; from the relations of m and k to u and c? given previously, the method 
of moments estimates of m and k are 
X 

x? 
ó?—X 

Another relatively simple method of estimation of m and k is based on the number 

of zeros. The probability of the count being zero is 


m —k 
p= (ter) 


If m is estimated by the sample mean and there are no zeros out of a sample size of 
n, then k is estimated by k, where k satisfies 


no — 14% 
no k 


Although the solution cannot be obtained in closed form, it is not difficult to find by 
iteration. 

Figure 8.11, from Anscombe (1950), shows the asymptotic efficiencies of the two 
methods of estimation of the negative binomial parameters relative to the maximum 
likelihood estimate. In the figure, the method of moments is method 1 and the method 
based on the number of zeros is method 2. Method 2 is quite efficient when the mean 
is small—that is, when there are a large number of zeros. Method 1 becomes more 
efficient as k increases. 


E 
Il 


2 
Il 


—k 
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FIGURE 8.11 Asymptotic efficiencies of estimates of negative binomial parameters. 


The maximum likelihood estimate is asymptotically efficient but is somewhat 
more difficult to compute. The equations will not be written out here. Bliss and Fisher 
(1953) discuss computational methods and give several examples. The maximum 


likelihood estimate of m is the sample mean, but that of k is the solution of a nonlinear 
equation. 


Insect Counts 


Let us consider an example from Bliss and Fisher (1953). From each of 6 apple trees 
in an orchard that was sprayed, 25 leaves were selected. On each of the leaves, the 
number of adult female red mites was counted. Intuitively, we might conclude that 
this situation was too heterogeneous for a Poisson model to fit; the rates of infestation 
might be different on different trees and at different locations on the same tree. 
The following table shows the observed counts and the expected counts from fitting 


Poisson and negative binomial distributions. The mle’s for k and m were k = 1.025 
and m = 1.146. 








Number Observed Poisson Negative Binomial 
per Leaf Count Distribution Distribution 

0 70 47.7 69.5 

1 38 54.6 37.6 

2 17 31.3 20.1 

3 10 12.0 10.7 

4 9 3.4 5.7 

5 3 15 3.0 

6 2 .15 1.6 

7 1 .03 .85 

8+ 0 .00 .95 


9.8 
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Casual inspection of this table makes it clear that the Poisson does not fit; there 


are many more small and large counts observed than are expected for a Poisson 
distribution. L| 


A recursive relation is useful in fitting the negative binomial distribution: 





Sufficiency 


This section introduces the concept of sufficiency and some of its theoretical impli- 
cations. Suppose that X,,..., X, is a sample from a probability distribution with the 
density or frequency function f (x|0). The concept of sufficiency arises as an attempt 
to answer the following question: Is there a statistic, a function T (X1, ..., X,), that 
contains all the information in the sample about 0? If so, a reduction of the original 
data to this statistic without loss of information is possible. For example, consider 
a sequence of independent Bernoulli trials with unknown probability of success, 0. 
We may have the intuitive feeling that the total number of successes contains all 
the information about 0 that there is in the sample, that the order in which the suc- 
cesses occurred, for example, does not give any additional information. The following 
definition formalizes this idea. 


DEFINITION 

A statistic 7(X4,..., X,) is said to be sufficient for 0 if the conditional dis- 
tribution of X4,..., Xn, given T = f, does not depend on 0 for any value 
of t. a 


In other words, given the value of T, which is called a sufficient statistic, we can 
gain no more knowledge about 0 from knowing more about the probability distribution 
of X,,..., Xn. (Formally, we could envision keeping only T and throwing away all 
the X; without any loss of information. Informally, and more realistically, this would 
make no sense at all. The values of the X; might indicate that the model did not fit or 
that something was fishy about the data. What would you think, for example, if you 
saw 50 ones followed by 50 zeros in a sequence of supposedly independent Bernoulli 
trials?) 
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EXAMPLE A Let X;,..., X, be a sequence of independent Bernoulli random variables with 


8.8.1 


P(X; = 1) = 0. We will verify that T = $7 , X; is sufficient for 0. 


PX HAs Xe HST 2t) 
P(T =t) 





P(X, =%,...,X, =xX,|T =t) = 


Bearing in mind that the X; can take on only the values Os and 1s, the probability 
in the numerator is the probability that some particular set of t X; are equal to Is 
and the other n — t are Os. Since the X; are independent, the probability of this is 
the product of the marginal probabilities, or 0'(1 — 6)"~'. To find the denominator 
note that the distribution of T, the total number of ones, is binomial with n trials and 
probability of success 0. The ratio in question is thus 


8'(1 _ gyn 1 





A ed pine n 
ee 1) 


The conditional distribution thus does not involve 0 at all. Given the total number of 
ones, the probability that they occur on any particular set of t trials is the same for 
any value of 0 so that set of trials contains no additional information about 0. L| 


A Factorization Theorem 


The preceding definition of sufficiency is hard to work with, because it does not 
indicate how to go about finding a sufficient statistic, and given a candidate statistic, 
T, it would typically be very hard to conclude whether it was sufficient because of 
the difficulty in evaluating the conditional distribution. The following factorization 
theorem provides a convenient means of identifying sufficient statistics. 


THEOREMA 


A necessary and sufficient condition for T (X;,..., X,) to be sufficient for a 
parameter 0 is that the joint probability function (density function or frequency 
function) factors in the form 


Cito oos Sa) = GUL Goo 0.05 2825 MACH oo dea) 
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Proof 

We give a proof for the discrete case. (The proof for the general case is more 
subtle and requires regularity conditions, but the basic ideas are the same.) First, 
suppose that the frequency function factors as given in the theorem. To simplify 


notation, we will let X denote (X,,..., X,) and x denote (x1, ..., Xn). We have 
DC 
= g(t,0) 3, hæ 
quoc) 


Here the notation indicates that the sum is over all x such that T (x) = t. We then 


have 
P=, 0 = 
= h(x) 
Sos) 
T(X)=t 


This conditional distribution does not depend on 0, as was to be shown. 
To show that the conclusion holds in the other direction, suppose that the 
conditional distribution of X given T is independent of 0. Let 


g(t,0) = P(T = tJ0) 
h(x) = P(X — x|T — t) 
We then have 
P(X em x0) e Jer e npe esr 25) 
= g(t, 8)h(x) 


as was to be shown. [| 


We can demonstrate the utility of Theorem A by applying it to some examples. 
More examples are included in the problems at the end of this chapter. 


EXAMPLE A Consider a sequence of independent Bernoulli random variables, X;,..., Xn, where 
P(X; 2x)20*(1—0)'7*, x=O0orx=1 


then 


n 


feo = [[e*a - o7 


i-l 


= NET (1 = 9)'- EL 
n 


0 DE 9 
= (5) (1—8y 
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We see that f(x|0) depends only on xi, X2, ..., x, through the sufficient statistic 
t = 5 ax; and f(x|0) is of the form g( 7; , x;, 0)h (x), where h(x) = 1 and 
0 t 
t80) = 1—0)” 
gr. 0) (, - ;) (1 — 6) i 


EXAMPLE B_ Consider a random sample from a normal distribution that has an unknown mean and 
variance. We have 





felu,0)=|[— exp [e 2i 
i-l ov 2n 20? 





1 =i n 3 
=F E 2 0 p) | 


1 E 1 n n 
2 2 
o” (2x)? exp = (>: " 2n ) xi TEE 
This expression is just a function of $7; ,x; and $,, , x7, which are therefore 
sufficient statistics. In this example we have a two-dimensional sufficient statistic. 
Although Theorem A was stated explicitly for a one-dimensional sufficient statistic, 
the multidimensional analogue holds also. E 





Because the likelihood, 


fGi, Xp 0) = g[T xi, ttt, Xn); 90]h Qa, ttt, Xn) 
it depends only on the data through T(x, ..., Xn). The maximum likelihood esti- 
mate is found by maximizing g[T (xi, .. . , Xn), 0]. In Example A, the likelihood is a 


function of t = $7 , xi, and the maximum likelihood estimate is Ó = t/n. 

Similarly, in a Bayesian framework, the posterior distribution of 0 is proportional 
to the product of the prior distribution of 0 and the likelihood. As a function of 0, the 
posterior distribution thus depends only on the data through g[T (x1, ..., Xn), 0]—the 
posterior probability of 0 is the same for all (x1, . . . , Xn} which have a common value 
of T (xi, ..., Xn). The sufficient statistic carries all the information about 0 that is 
contained in the data x1, xo, ..., Xn. 

A study of the properties of probability distributions that have sufficient statistics 
of the same dimension as the parameter space regardless of sample size led to the 
development of what is called the exponential family of probability distributions. 
Many common distributions, including the normal, the binomial, the Poisson, and 
the gamma, are members of this family. One-parameter members of the exponential 
family have density or frequency functions of the form 


f (x|8) = exp[c(8)T (x) + d@) + $(x)], x€A 
EIU xgA 


EXAMPLEC 
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where the set A does not depend on 0. Suppose that X;,..., X, is a sample from a 
member of the exponential family; the joint probability function is 


flO) = | | exple@)T Gi) + d@) $61 


i=] 


= exp [c(6) M T(x) + nd) exp 5 se) 


i=1 i=] 


From this result, it is apparent by the factorization theorem that $ 7 , T(X;) is a 
sufficient statistic. 


The frequency function of the Bernoulli distribution is 


P(X-2x)20*'(0—90)"*, x=O0orx=1 


= exp f log (5) + log(1— | 


This is a member of the exponential family with T (x) — x, and we have already seen 
that 5 7. , X;, is a sufficient statistic for a sample from the Bernoulli distribution. NI 


A k-parameter member of the exponential family has a density or frequency 
function of the form 
k 
f(x|0) = exp S i @)Ti (x) + d(6) + S(x)|, x€A 
i=l 


= 0, xgA 


where the set A does not depend on 0. 

The normal distribution is of this form. A great deal of theoretical work has 
centered around the exponential family; further discussion of this family can be found 
in Bickel and Doksum (2001). 

We conclude this section with the following corollary of Theorem A. 


COROLLARY A 
If T is sufficient for 0, the maximum likelihood estimate is a function of T. 


Proof 


From Theorem A, the likelihood is g(T, 0)h (x), which depends on 0 only through 
T. To maximize this quantity, we need only maximize g(T, 0). H 
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8.8.2 


Corollary A and the Rao-Blackwell theorem of the next section may be interpreted 
as giving some theoretical support to the use of maximum likelihood estimates. 


The Rao-Blackwell Theorem 


In the preceding section, we argued for the importance of sufficient statistics on essen- 
tially qualitative grounds. The Rao-Blackwell theorem gives a quantitative rationale 
for basing an estimator of a parameter 0 on a sufficient statistic if one exists. 


THEOREM A Rao-Blackwell Theorem 


Let Ó be an estimator of 0 with E (°) < oo for all 0. Suppose that T is sufficient 
for 0, and let 0 = E(0|T). Then, for all 0, 


E(@ — 6) < E(0 — 6) 


The inequality is strict unless Ó = 0. 


Proof 


We first note that, from the property of iterated conditional expectation 
(Theorem A of Section 4.4.1), 


E() = E[E(6|T)] = E (ô) 


Therefore, to compare the mean squared error of the two estimators, we need 
only compare their variances. From Theorem B of Section 4.4.1, we have 


Var(@) = Var[E(0|T)] + E[Var(6|T)] 
or 
Var(@) = Var(@) + E[Var(6|T)] 


Thus, Var(Ó) > Var(@) unless Var(Ó|T) = 0, which is the case only if Ó isa 
function of T, which would imply 6 = 0. " 


Since E (6 |T) is a function of the sufficient statistic T , the Rao-Blackwell theorem 
gives a strong rationale for basing estimators on sufficient statistics if they exist. If an 
estimator is not a function of a sufficient statistic, it can be improved. 

Suppose that there are two estimates, 6; and 6,, having the same expectation. 
Assuming that a sufficient statistic T exists, we may construct two other estimates, 0, 
and 6, by conditioning on T. The theory we have developed so far gives no clues as 
to which one of these two is better. If the probability distribution of T has the property 
called completeness, 6, and Ó, are identical, by a theorem of Lehmann and Scheffé. 
We will not define completeness or pursue this topic further; Lehmann and Casella 
(1998) and Bickel and Doksum (2001) discuss this concept. 


8.9 
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Concluding Remarks 


Certain key ideas first introduced in the context of survey sampling in Chapter 7 have 
recurred in this chapter. We have viewed an estimate as a random variable having a 
probability distribution called its sampling distribution. In Chapter 7, the estimate was 
of a parameter, such as the mean, of a finite population; in this chapter, the estimate 
was of a parameter of a probability distribution. In both cases, characteristics of 
the sampling distribution, such as the bias and the variance and the large sample 
approximate form, have been of interest. In both chapters, we studied confidence 
intervals for the true value of the unknown parameter. The method of propagation of 
error, or linearization, has been a useful tool in both chapters. These key ideas will 
be important in other contexts in later chapters as well. 

Important concepts and techniques in estimation theory were introduced in this 
chapter. We discussed two general methods of estimation—the method of moments 
and the method of maximum likelihood. The latter especially has great general utility 
in statistics. We developed and applied some approximate distribution theory for 
maximum likelihood estimates. Other theoretical developments included the concept 
of efficiency, the Cramér-Rao lower bound, and the concept of sufficiency and some 
of its consequences. 

Bayesian inference was introduced in this chapter. The point of view contrasts 
rather sharply with that of frequentist inference in that the Bayesian formalism allows 
uncertainty statements about parameter values to be probabilistic, for example, “After 
seeing the data, the probability is 95% that 1.8 < 0 < 6.3.” In frequentist inference, 
0 is not a random variable, and a statement like this would literally make no sense; it 
would be replaced by, “A 95% confidence interval for 0 is [1.8, 6.3],” perhaps followed 
by a long convoluted explication of the meaning of a confidence interval. Despite this 
apparently sharp philosophical difference, Bayesian and frequentist procedures have a 
great deal in common and typically lead to similar conclusions. Despite the distinction 
between the two statements above, the statements may well mean essentially the 
same thing operationally to a practitioner who has analyzed the data. The likelihood 
function is fundamental for both frequentist and Bayesian inference. In an application, 
the choice of a model, that is, the choice of a likelihood function, will typically 
be much more important than whether on subsequently multiplies it be a prior or 
just maximizes it. This is especially true if flat priors are used; in fact, one might 
regard a flat prior as a device that allows the likelihood to be treated as a probability 
density. 

In this chapter, we introduced the bootstrap method for assessing the variability 
of an estimate. Such uses of simulation have become increasingly widespread as 
computers have become faster and cheaper; the bootstrap as a general method has 
been developed only quite recently and has rapidly become one of the most important 
statistical tools. We will see other situations in which the bootstrap is useful in later 
chapters. Efron and Tibshirani (1993) give an excellent introduction to the theory and 
applications of the bootstrap. 

The context in which we have introduced the bootstrap is often referred to as the 
parametric bootstrap. The nonparametric bootstrap will be introduced in Chapter 10. 
The parametric bootstrap can be thought about somewhat abstractly in the following 
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way. We have data x that we regard as being generated by a probability distribution 
F(x|0), which depends on a parameter 0. We wish to know EA(X, 0) for some 
function A ( ). For example, if 0 itself is estimated from the data as 6 (x) and A(X, 0) = 
[6(X) — 6], then EA(X, 0) is the mean square error of the estimate. As another 
example, if 


1 if |O(X) 0| > A 
0 otherwise 


h(X, 0) ={ 


then Eh(X, 0) is the probability that jô (X) — 0| > A. We realize that if 0 were known, 
we could use the computer to generate independent random variables X;, X3, ..., Xg 
from 7 (x|0) and then appeal to the law of large numbers: 


[4 
Eh(X,0) + — h(X;, 0 
(X, 0) BH ) 


This approximation could be made arbitrarily precise by choosing B sufficiently large. 
The parametric bootstrap principle is to perform this Monte Carlo simulation using 6 
in place of the unknown 6—that is, using F (x|) to generate the X;. It is difficult to 
give a concise answer to the natural question: How much error is introduced by using 
Ó in place of 0? The answer depends on the continuity of Eh(X, 0) as a function of 
6—if small changes in 0 can give rise to large changes in Eh(X, 0), the parametric 
bootstrap will not work well. 


8.10 Problems 


1. The following table gives the observed counts in l-second intervals for 
Berkson's data (Section 8.2). What are the expected counts from a Poisson dis- 
tribution? Do they match the observed counts? 





n Observed 
0 5267 
1 4436 
2 1800 
3 534 
4 111 
54- 21 





2. The Poisson distribution has been used by traffic engineers as a model for light 
traffic, based on the rationale that if the rate is approximately constant and the 
traffic is light (so the individual cars move independently of each other), the 
distribution of counts of cars in a given time interval or space area should be nearly 
Poisson (Gerlough and Schuhl 1955). The following table shows the number of 
right turns during 300 3-min intervals at a specific intersection. Fit a Poisson 
distribution. Comment on the fit by comparing observed and expected counts. It 
is useful to know that the 300 intervals were distributed over various hours of the 
day and various days of the week. 


8.10 Problems 313 








n Frequency 
0 14 
1 30 
2 36 
3 68 
4 43 
2 43 
6 30 
7 14 
8 10 
9 6 
10 4 
11 1 
12 1 
13+ 0 





3. One of the earliest applications of the Poisson distribution was made by Student 
(1907) in studying errors made in counting yeast cells or blood corpuscles with 
a haemacytometer. In this study, yeast cells were killed and mixed with water 
and gelatin; the mixture was then spread on a glass and allowed to cool. Four 
different concentrations were used. Counts were made on 400 squares, and the 
data are summarized in the following table: 





Number Concentration Concentration Concentration Concentration 





of Cells 1 2 3 4 
0 213 103 75 0 
1 128 143 103 20 
2 37 98 121 43 
3 18 42 54 53 
4 3 8 30 86 
5 1 4 13 70 
6 0 2 2 54 
7 0 0 1 37 
8 0 0 0 18 
9 0 0 1 10 

10 0 0 0 5 
11 0 0 0 2 
12 0 0 0 2 





a. Estimate the parameter A for each of the four sets of data. 
b. Find an approximate 9596 confidence interval for each estimate. 
c. Compare observed and expected counts. 


4. Suppose that X is a discrete random variable with 


3 
P(X =0) = 36 
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P(X =1)= lo 
CT 
2 
P(X =2)= 3; — 8) 


P =3)= 31-8) 


where 0 < 0 < 1 is a parameter. The following 10 independent observations 

were taken from such a distribution: (3, 0, 2, 1, 3, 2, 1, 0, 2, 1). 

. Find the method of moments estimate of 0. 

. Find an approximate standard error for your estimate. 

. What is the maximum likelihood estimate of 0? 

. What is an approximate standard error of the maximum likelihood estimate? 

. Ifthe prior distribution of O is uniform on [0, 1], what is the posterior density? 
Plot it. What is the mode of the posterior? 


ona asf 


5. Suppose that X is a discrete random variable with P(X = 1) = 0 and P(X = 2) 
= | — 0. Three independent observations of X are made: x, = 1, x2 = 2, x3 = 2. 


. Find the method of moments estimate of 0. 

. What is the likelihood function? 

. What is the maximum likelihood estimate of 0? 

. If O has a prior distribution that is uniform on [0, 1], what is its posterior 
density? 


an mp 


6. Suppose that X ~ bin(n, p). 
a. Show that the mle of p is p = X/n. 
b. Show that mle of part (a) attains the Cramér-Rao lower bound. 
c. If 1 = 10 and X — 5, plot the log likelihood function. 


7. Suppose that X follows a geometric distribution, 
P(X =k) = p- py 


and assume an i.i.d. sample of size n. 


a. Find the method of moments estimate of p. 

b. Find the mle of p. 

c. Find the asymptotic variance of the mle. 

d. Let p have a uniform prior distribution on [0, 1]. What is the posterior distri- 
bution of p? What is the posterior mean? 


8. In an ecological study of the feeding behavior of birds, the number of hops 
between flights was counted for several birds. For the following data, (a) fit a 
geometric distribution, (b) find an approximate 9596 confidence interval for p, (c) 


9. 


10. 


11. 


12. 


13. 
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examine goodness of fit. (d) If a uniform prior is used for p, what is the posterior 
distribution and what are the posterior mean and standard deviation? 








Number of Hops Frequency 
1 48 
2 31 
3 20 
4 9 
5 6 
6 5 
7 4 
8 2 
9 1 

10 1 
11 2 
12 1 





How would you respond to the following argument? This talk of sampling dis- 
tributions is ridiculous! Consider Example A of Section 8.4. The experimenter 
found the mean number of fibers to be 24.9. How can this be a “random variable” 
with an associated “probability distribution” when it's just a number? The author 
of this book is guilty of deliberate mystification! 


Use the normal approximation of the Poisson distribution to sketch the approxi- 
mate sampling distribution of 4 of Example A of Section 8.4. According to this 
approximation, what is P(|Ao — A > 6) for d = .5, 1, 1.5, 2, and 2.5, where Ag 
denotes the true value of A? 


In Example A of Section 8.4, we used knowledge of the exact form of the sampling 
distribution of 4 to estimate its standard error by 


a 


y= {i- 


This was arrived at by realizing that X` X; follows a Poisson distribution with 
parameter nào. Now suppose we hadn't realized this but had used the bootstrap, 
letting the computer do our work for us by generating B samples of size n — 23 
of Poisson random variables with parameter A = 24.9, forming the mle of A from 
each sample, and then finally computing the standard deviation of the resulting 
collection of estimates and taking this as an estimate of the standard error of Â. 
Argue that as B — ox, the standard error estimated in this way will tend to s;. 


Suppose that you had to choose either the method of moments estimates 
or the maximum likelihood estimates in Example C of Section 8.4 and C of 
Section 8.5. Which would you choose and why? 


In Example D of Section 8.4, the method of moments estimate was found to be 
â = 3X. In this problem, you will consider the sampling distribution of â. 


a. Show that E(@) = a—that is, that the estimate is unbiased. 
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14. 


15. 


16. 


17. 


18. 


b. Show that Var(&à) = (3 — o?)/n. [Hint: What is Var(X)?] 

c. Use the central limit theorem to deduce a normal approximation to the sam- 
pling distribution of à. According to this approximation, ifn = 25 anda = 0, 
what is the P(|@| > .5)? 


In Example C of Section 8.5, how could you use the bootstrap to estimate the 
following measures of the accuracy of à: (a) P(|@ — a| > .05), (b) E(Jà — aol), 
(c) that number A such that P(|& — ao| > ^) = .5. 


The upper quartile of a distribution with cumulative distribution F is that point 
q.25 such that F(q55) = .75. Fora gamma distribution, the upper quartile depends 
on o and A, so denote it as g(a, A). If a gamma distribution is fit to data as in 
Example C of Section 8.5 and the parameters o and A are estimated by & and A, 
the upper quartile could then be estimated by 4 = q (â, Â). Explain how to use 
the bootstrap to estimate the standard error of ĝ. 


Consider an i.i.d. sample of random variables with density function 


a. Find the method of moments estimate of o. 
b. Find the maximum likelihood estimate of o. 
c. Find the asymptotic variance of the mle. 

d. Find a sufficient statistic for o. 


Suppose that X,, X2,..., X, are ii.d. random variables on the interval [0, 1] 
with the density function 
Dl (2a) 


roy - 9r 





f(x|o) = 


where œ > 0 is a parameter to be estimated from the sample. It can be shown 
that 


1 
E(X) = 5 
1 


US SOPORE 


. How does the shape of the density depend on o? 

. How can the method of moments be used to estimate o? 
. What equation does the mle of o satisfy? 

. What is the asymptotic variance of the mle? 

. Find a sufficient statistic for o. 


oe ceoctz 


Suppose that X,, X2,..., X, are ii.d. random variables on the interval [0, 1] 
with the density function 
l'(3a) 


= o—1 2a—1 
fl) = zc as 0-9 





where œ > 0 is a parameter to be estimated from the sample. It can be shown 


19. 


20. 


21. 


22. 


23. 


24. 


25. 
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that 


E(x) -1 
)-3 
2 


Va CO = ga D 


a. How could the method of moments be used to estimate o? 
b. What equation does the mle of o satisfy? 

c. What is the asymptotic variance of the mle? 

d. Find a sufficient statistic for a. 


Suppose that X,, X2, ..., X, are iid. N (u, o°). 


a. If u is known, what is the mle of o? 

b. If o is known, what is the mle of u? 

c. In the case above (o known), does any other unbiased estimate of u have 
smaller variance? 


Suppose that X;, X5, ..., X25 are i.i.d. N (u, c?), where u = O0 ando = 10. Plot 
the sampling distributions of X and 6?. 


Suppose that X1, X5, ..., X, are i.i.d. with density function 
FEIO 2 e &®, x86 


and f (x|0) = 0 otherwise. 


a. Find the method of moments estimate of 0. 

b. Find the mle of 0. (Hint: Be careful, and don't differentiate before thinking. 
For what values of 0 is the likelihood positive?) 

c. Find a sufficient statistic for 0. 


The Weibull distribution was defined in Problem 67 of Chapter 2. This distribution 
is sometimes fit to lifetimes. Describe how to fit this distribution to data and how 
to find approximate standard errors of the parameter estimates. 


A company has manufactured certain objects and has printed a serial number 
on each manufactured object. The serial numbers start at 1 and end at N, where 
N is the number of objects that have been manufactured. One of these objects 
is selected at random, and the serial number of that object is 888. What is the 
method of moments estimate of N? What is the mle of N? 


Find a very new shiny penny. Hold it on its edge and spin it. Do this 20 times 
and count the number of times it comes to rest heads up. Letting z denote the 
probability of a head, graph the log likelihood of zr. Next, repeat the experiment 
in a slightly different way: This time spin the coin until 10 heads come up. Again, 
graph the log likelihood of zr. 


If a thumbtack is tossed in the air, it can come to rest on the ground with either 
the point up or the point touching the ground. Find a thumbtack. Before doing 
any experiment, what do you think z, the probability of it landing point up, is? 
Next, toss the thumbtack 20 times and graph the log likelihood of x. Then do 
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26. 


27. 


28. 


29. 


30. 


31. 


32. 


another experiment: Toss the thumbtack until it lands point up 5 times, and graph 
the log likelihood of x based on this experiment. 

Find and graph the posterior distribution arising from a uniform prior on zr. 
Find the posterior mean and standard deviation and compare the posterior with 
a normal distribution with that mean and standard deviation. Finally, toss the 
thumbtack 20 more times and compare the posterior distribution based on all 40 
tosses to that based on the first 20. 


Inaneffortto determine the size of an animal population, 100 animals are captured 
and tagged. Some time later, another 50 animals are captured, and it is found 
that 20 of them are tagged. How would you estimate the population size? What 
assumptions about the capture/recapture process do you need to make? (See 
Example I of Section 1.4.2.) 


Suppose that certain electronic components have lifetimes that are exponentially 
distributed: f(t|t) = (1/t) exp(—t/t), t > 0. Five new components are put on 
test, the first one fails at 100 days, and no further observations are recorded. 

a. What is the likelihood function of t? 

b. What is the mle of t? 

c. What is the sampling distribution of the mle? 

d. What is the standard error of the mle? 


(Hint: See Example A of Section 3.7.) 


Why do the intervals in the left panel of Figure 8.8 have different centers? Why 
do they have different lengths? 


Are the estimates of o? at the centers of the confidence intervals shown in the 
right panel of Figure 8.8? Why are some intervals so short and others so long? For 
which of the samples that produced these confidence intervals was 6? smallest? 


The exponential distribution is f (x; à) = Ae~** and E(X) = A7!. The cumula- 
tive distribution function is F(x) = P(X < x) = 1 — e^". Three observations 
are made by an instrument that reports x, = 5 and x2 = 3, but x; is too large for 
the instrument to measure and it reports only that x3 > 10. (The largest value the 
instrument can measure is 10.0.) 


a. What is the likelihood function? 
b. What is the mle of A? 


George spins a coin three times and observes no heads. He then gives the coin to 
Hilary. She spins it until the first head occurs, and ends up spinning it four times 
total. Let 0 denote the probability the coin comes up heads. 


a. What is the likelihood of 0? 
b. What is the MLE of 0? 


The following 16 numbers came from normal random number generator on a 
computer: 


5.3299 4.2537 3.1502 3.7032 1.6070 6.3923 3.1181 


6.5941 3.5281 4.7433 0.1077 1.5977 5.4920 1.7220 
4.1547 2.27799 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 
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a. What would you guess the mean and variance (u and o?) of the generating 
normal distribution were? 

b. Give 90%, 95%, and 99% confidence intervals for jz and o?. 

. Give 90%, 95%, and 99% confidence intervals for o. 

d. How much larger a sample do you think you would need to halve the length 
of the confidence interval for u? 


e 


Suppose that X1, X5, ..., X, areii.d. N (u, a”), where u and o are unknown. 
How should the constant c be chosen so that the interval (—00, X + c) is a 9596 
confidence interval for jz; that is, c should be chosen so that P(—oo < u < 
X +c) = 95. 


Suppose that X;, X2, ..., X, are iid. N (uo, 02) and u and o? are estimated by 

the method of maximum likelihood, with resulting estimates fi and 67. Suppose 

the bootstrap is used to estimate the sampling distribution of fi. 

a. Explain why the bootstrap estimate of the distribution of fi is N (f, E 

b. Explain why the bootstrap estimate of the distribution of fi — jo is N (0, =). 

c. According to the result of the previous part, what is the form of the bootstrap 
confidence interval for u, and how does it compare to the exact confidence 
interval based on the ¢ distribution? 


(Bootstrap in Example A of Section 8.5.1) Let U1, U2, ..., Uio29 be independent 
uniformly distributed random variables. Let X, equal the number of U; less than 
.331, X» equal the number between .331 and .820, and X5 equal the number 
greater than .820. Why is the joint distribution of X,, X», and X5 multinomial 
with probabilities .331, .489, and .180 and n = 1029? 


How do the approximate 90% confidence intervals in Example E of Section 8.5.3 
compare to those that would be obtained approximating the sampling distributions 
of â and 4 by normal distributions with standard deviations given by sg and s; 
as in Example C of Section 8.5? 


Using the notation of Section 8.5.3, suppose that 0 and 6 are lower and upper 
quantiles of the distribution of 6*. Show that the bootstrap confidence interval 
for 0 can be written as (26 — 0, 26 — 0). 


Continuing Problem 37, show that if the sampling distribution of 0* is symmetric 
about Ó, then the bootstrap confidence interval is (0, 0). 


In Section 8.5.3, the bootstrap confidence interval was derived from consideration 
ofthe sampling distribution of Ó —0,. Suppose that we had started with considering 
the distribution of 0/0. How would the argument have proceeded, and would the 
bootstrap interval that was finally arrived at have been different? 


In Example A of Section 8.5.1, how could you use the bootstrap to estimate the 
following measures of the accuracy of 0: (a) P (Ó — o| > .01), (b) E (|Ó — bol), 
(c) that number A such that P(|Ó —09| > ^A) 2.5? 


What are the relative efficiencies of the method of moments and maximum like- 
lihood estimates of œ and X in Example C of Section 8.4 and Example C of 
Section 8.5? 
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42. 


43. 


44. 


45. 


The file gamma-ray contains a small quantity of data collected from the Comp- 
ton Gamma Ray Observatory, a satellite launched by NASA in 1991 
(http://cossc.gsfc.nasa.gov/). For each of 100 sequential time 
intervals of variable lengths (given in seconds), the number of gamma rays 
originating in a particular area of the sky was recorded. Assuming a model 
that the arrival times are a Poisson process with constant emission rate (A = 
events per second), estimate A. What is the estimated standard error? How might 
you informally check the assumption that the emission rate is constant? What is 
the posterior distribution of A if an improper gamma prior is used? 


The file gamma- arrivals contains another set of gamma-ray data, this one 
consisting of the times between arrivals (interarrival times) of 3,935 photons 
(units are seconds). 


a. Make a histogram of the interarrival times. Does it appear that a gamma 
distribution would be a plausible model? 

b. Fit the parameters by the method of moments and by maximum likelihood. 
How do the estimates compare? 

c. Plot the two fitted gamma densities on top of the histogram. Do the fits look 
reasonable? 

d. Forboth maximum likelihood and the method of moments, use the bootstrap to 
estimate the standard errors of the parameter estimates. How do the estimated 
standard errors of the two methods compare? 

e. For both maximum likelihood and the method of moments, use the bootstrap 
to form approximate confidence intervals for the parameters. How do the 
confidence intervals for the two methods compare? 

f. Is the interarrival time distribution consistent with a Poisson process model 
for the arrival times? 


The file body t emp contains normal body temperature readings (degrees Fahren- 
heit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females 
(coded by 2) from Shoemaker (1996). Assuming that the population distributions 
are normal (an assumption that will be investigated in a later chapter), estimate the 
means and standard deviations of the males and females. Form 95% confidence 
intervals for the means. Standard folklore is that the average body temperature is 
98.6 degrees Fahrenheit. Does this appear to be the case? 


A Random Walk Model for Chromatin. A human chromosome is a very large 
molecule, about 2 or 3 centimeters long, containing 100 million base pairs (Mbp). 
The cell nucleus, where the chromosome is contained, is in contrast only about a 
thousandth of a centimeter in diameter. The chromosome is packed in a series of 
coils, called chromatin, in association with special proteins (histones), forming 
a string of microscopic beads. It is a mixture of DNA and protein. In the G0/G1 
phase of the cell cycle, between mitosis and the onset of DNA replication, the 
mitotic chromosomes diffuse into the interphase nucleus. At this stage, a number 
of important processes related to chromosome function take place. For exam- 
ple, DNA is made accessible for transcription and is duplicated, and repairs are 
made of DNA strand breaks. By the time of the next mitosis, the chromosomes 
have been duplicated. The complexity of these and other processes raises many 
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questions about the large-scale spatial organization of chromosomes and how 
this organization relates to cell function. Fundamentally, it is puzzling how these 
processes can unfold in such a spatially restricted environment. 

At a scale of about 107? Mbp, the DNA forms a chromatin fiber about 30 
nm in diameter; at a scale of about 107! Mbp the chromatin may form loops. 
Very little is known about the spatial organization beyond this scale. Various 
models have been proposed, ranging from highly random to highly organized, 
including irregularly folded fibers, giant loops, radial loop structures, systematic 
organization to make the chromatin readily accessible to transcription and repli- 
cation machinery, and stochastic configurations based on random walk models 
for polymers. 

A series of experiments (Sachs et al., 1995; Yokota et al., 1995) were con- 
ducted to learn more about spatial organization on larger scales. Pairs of small 
DNA sequences (size about 40 kbp) at specified locations on human chromo- 
some 4 were flourescently labeled in a large number of cells. The distances 
between the members of these pairs were then determined by flourescence mi- 
croscopy. (The distances measured were actually two-dimensional distances be- 
tween the projections of the paired locations onto a plane.) The empirical dis- 
tribution of these distances provides information about the nature of large-scale 
organization. 

There has long been a tradition in chemistry of modeling the configurations 
of polymers by the theory of random walks. As a consequence of such a model, 
the two-dimensional distance should follow a Rayleigh distribution 


r 


E 
f(r|8) = gz &XP (=) 


Basically, the reason for this is as follows: The random walk model implies that 
the joint distribution of the locations of the pair in R? is multivariate Gaussian; by 
properties of the multivariate Gaussian, it can be shown the joint distribution of 
the locations of the projections onto a plane is bivariate Gaussian. As in Example 
A of Section 3.6.2 of the text, it can be shown that the distance between the points 
follows a Rayleigh distribution. 

In this exercise, you will fit the Rayleigh distribution to some of the experi- 
mental results and examine the goodness of fit. The entire data set comprises 36 
experiments in which the separation between the pairs of flourescently tagged 
locations ranged from 10 Mbp to 192 Mbp. In each such experimental condi- 
tion, about 100-200 measurements of two-dimensional distances were deter- 
mined. This exercise will be concerned just with the data from three experiments 
(short, medium, and long separation). The measurements from these experi- 
mentsiscontainedinthefiles Chromatin/short, Chromatin/medium, 
Chromatin/long. 


a. What is the maximum likelihood estimate of 0 for a sample from a Rayleigh 
distribution? 

b. What is the method of moments estimate? 

c. What are the approximate variances of the mle and the method of moments 
estimate? 
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d. For each of the three experiments, plot the likelihood functions and find the 

mle's and their approximate variances. 

. Find the method of moments estimates and the approximate variances. 

f. For each experiment, make a histogram (with unit area) of the measurements 
and plot the fitted densities on top. Do the fits look reasonable? Is there any 
appreciable difference between the maximum likelihood fits and the method 
of moments fits? 

g. Does there appear to be any relationship between your estimates and the 
genomic separation of the points? 

h. For one of the experiments, compare the asymptotic variances to the results 
obtained from a parametric bootstrap. In order to do this, you will have to 
generate random variables from a Rayleigh distribution with parameter 0. 

Show that if X follows a Rayleigh distribution with 0 = 1, then Y = 0X 
follows a Rayleigh distribution with parameter 0. Thus it is sufficient to figure 
out how to generate random variables that are Rayleigh, 0 — 1. Show how 
Proposition D of Section 2.3 of the text can be applied to accomplish this. 

B = 100 bootstrap samples should suffice for this problem. Make a 
histogram ofthe values ofthe 0*. Does the distribution appear roughly normal? 
Do you think that the large sample theory can be reasonably applied here? 
Compare the standard deviation calculated from the bootstrap to the standard 
errors you found previously. 

i. For one of the experiments, use the bootstrap to construct an approximate 95% 
confidence interval for 0 using B = 1000 bootstrap samples. Compare this 
interval to that obtained using large sample theory. 


e 


The data of this exercise were gathered as part of a study to estimate the population 
size of the bowhead whale (Raftery and Zeh 1993). The statistical procedures 
for estimating the population size along with an assessment of the variability of 
the estimate were quite involved, and this problem deals with only one aspect 
of the problem—a study of the distribution of whale swimming speeds. Pairs 
of sightings and corresponding locations that could be reliably attributed to the 
same whale were collected, thus providing an estimate of velocity for each whale. 
The velocities, vi, v2, ... , V210 (km/h), were converted into times fi, t2, .. ., £210 
to swim 1 km—+; = 1/v;. The distribution of the t; was then fit by a gamma 
distribution. The times are contained in the file whales. 


a. Make a histogram of the 210 values of t;. Does it appear that a gamma distri- 
bution would be a plausible model to fit? 

b. Fit the parameters of the gamma distribution by the method of moments. 

c. Fit the parameters of the gamma distribution by maximum likelihood. How 
do these values compare to those found before? 

d. Plot the two gamma densities on top of the histogram. Do the fits look rea- 
sonable? 

e. Estimate the sampling distributions and the standard errors of the parameters 
fit by the method of moments by using the bootstrap. 

f. Estimate the sampling distributions and the standard errors of the parameters 
fit by maximum likelihood by using the bootstrap. How do they compare to 
the results found previously? 


47. 


48. 


49. 


50. 


51. 


8.10 Problems 323 


g. Find approximate confidence intervals for the parameters estimated by maxi- 
mum likelihood. 


The Pareto distribution has been used in economics as a model for a density 
function with a slowly decaying tail: 


f Cle, 9) = Oxf x’, xX2x, >l 
Assume that x9 > O is given and that X,, X2,..., X, is an i.i.d. sample. 


a. Find the method of moments estimate of 0. 
b. Find the mle of 0. 

c. Find the asymptotic variance of the mle. 

d. Find a sufficient statistic for 0. 


Consider the following method of estimating A for a Poisson distribution. 
Observe that 
Po = P(X = 0) 2«* 


Letting Y denote the number of zeros from an i.i.d. sample of size n, à might be 


estimated by 
- Y 
à = — log (=) 
n 


Use the method of propagation of error to obtain approximate expressions for 
the variance and the bias of this estimate. Compare the variance of this estimate 
to the variance of the mle, computing relative efficiencies for various values of 
à. Note that Y ~ bin(7, po). 


For the example on muon decay in Section 8.4, suppose that instead of recording 
x = cos0, only whether the electron goes backward (x < 0) or forward (x > 0) 
is recorded. 


a. How could o be estimated from n independent observations of this type? 
(Hint: Use the binomial distribution.) 

b. What is the variance of this estimate and its efficiency relative to the method 
of moments estimate and the mle for à = 0, .1, .2, .3, .4, .5, .6, .7, .8, .9? 


Let X,,..., X, bean ii.d. sample from a Rayleigh distribution with parameter 
0-0: 


fGl) em Germ, #20 


(This is an alternative parametrization of that of Example A in Section 3.6.2.) 


a. Find the method of moments estimate of 0. 
b. Find the mle of 8. 
c. Find the asymptotic variance of the mle. 


The double exponential distribution is 
1 —|x-6| 
Jœl) = 5e o5 —00 < X « oO 


For an i.i.d. sample of size n = 2m + 1, show that the mle of 0 is the median 
of the sample. (The observation such that half of the rest of the observations are 
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53. 


54. 


55. 


56. 


smaller and half are larger.) [Hint: The function g(x) = |x| is not differentiable. 
Draw a picture for a small value of n to try to understand what is going on.] 


Let X4, ..., X, bei.i.d. random variables with the density function 
fal) 20D,  O0zxzl 


a. Find the method of moments estimate of 0. 
b. Find the mle of 0. 

c. Find the asymptotic variance of the mle. 

d. Find a sufficient statistic for 0. 


Let X,,..., X, be i.i.d. uniform on [0, 6]. 


a. Find the method of moments estimate of 0 and its mean and variance. 

b. Find the mle of 0. 

c. Find the probability density of the mle, and calculate its mean and variance. 
Compare the variance, the bias, and the mean squared error to those of the 
method of moments estimate. 

d. Find a modification of the mle that renders it unbiased. 


Suppose that an i.i.d. sample of size 15 from a normal distribution gives X — 10 
and s? = 25. Find 90% confidence intervals for u and o?. 


For two factors—starchy or sugary, and green base leaf or white base leaf—the 
following counts for the progeny of self-fertilized heterozygotes were observed 
(Fisher 1958): 








Type Count 
Starchy green 1997 
Starchy white 906 
Sugary green 904 
Sugary white 32 





According to genetic theory, the cell probabilities are .25(2 + 0), .25(1 — 0), 
.25(1 — 0), and .250, where 0(0 < 0 < 1) is a parameter related to the linkage 
of the factors. 


a. Find the mle of 0 and its asymptotic variance. 

b. Form an approximate 9596 confidence interval for 0 based on part (a). 

c. Use the bootstrap to find the approximate standard deviation of the mle and 
compare to the result of part (a). 

d. Usethe bootstrap to find an approximate 9596 confidence interval and compare 
to part (b). 


Referring to Problem 55, consider two other estimates of 0. (1) The expected 
number of counts in the first cell is n(2 + 0)/4; if this expected number is 
equated to the count X,, the following estimate is obtained: 


57. 


58. 


59. 
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(2) The same procedure done for the last cell yields 


OEIL 
n 
Compute these estimates. Using that X, and X, are binomial random variables, 
show that these estimates are unbiased, and obtain expressions for their vari- 
ances. Evaluate the estimated standard errors and compare them to the estimated 
standard error of the mle. 


This problem is concerned with the estimation of the variance of a normal dis- 
tribution with unknown mean from a sample X,,..., X, of i.i.d. normal random 
variables. In answering the following questions, use the fact that (from Theorem B 
of Section 6.3) 


(n— Ds? m 
EET LEM Xn-1 


c 
and that the mean and variance of a chi-square random variable with r df are 
r and 2r, respectively. 

a. Which of the following estimates is unbiased? 


jl n 0 1 n 0 

2 2 a2 2 
= XX 25y (Xi-X 

s du y» ô 52. ) 


n—1^4 





b. Which of the estimates given in part (a) has the smaller MSE? 
c. For what value of p does p $5; (X; — X)? have the minimal MSE? 


If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur 
with probabilities (1 — 0)?, 20(1 — 0), and 0?, respectively. Plato et al. (1964) 
published the following data on haptoglobin type in a sample of 190 people: 


Haptoglobin Type 
Hpl-1 Hp1-2 Hp2-2 
10 68 112 





. Find the mle of 6. 

. Find the asymptotic variance of the mle. 

. Find an approximate 99% confidence interval for 0. 

. Use the bootstrap to find the approximate standard deviation of the mle and 
compare to the result of part (b). 

e. Usethe bootstrap to find an approximate 9996 confidence interval and compare 

to part (c). 


ann Dp 


Suppose that in the population of twins, males (M) and females (F) are equally 
likely to occur and that the probability that twins are identical is o. If twins are 
not identical, their genes are independent. 


a. Show that 
P(MM) = P(FF) = 





P(MF) = 





l+a l-a 
4 
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64. 
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b. Suppose that n twins are sampled. It is found that n; are MM, n; are FF, and 
n; are MF, but it is not known which twins are identical. Find the mle of a 
and its variance. 


Let X,,..., X, be an i.d. sample from an exponential distribution with the 
density function 


dj 
f(x|t) = -e*"", 0<x «oo 
T 


Find the mle of c. 

What is the exact sampling distribution of the mle? 

c. Use the central limit theorem to find a normal approximation to the sampling 
distribution. 

d. Show that the mle is unbiased, and find its exact variance. (Hint: The sum of 
the X; follows a gamma distribution.) 

e. Is there any other unbiased estimate with smaller variance? 

f. Find the form of an approximate confidence interval for r. 

g. Find the form of an exact confidence interval for r. 


DE 


Laplace's rule of succession. Laplace claimed that when an event happens n times 
in a row and never fails to happen, the probability that the event will occur the 
next time is (n + 1)/(n + 2). Can you suggest a rationale for this claim? 


Show that the gamma distribution is a conjugate prior for the exponential distri- 
bution. Suppose that the waiting time in a queue is modeled as an exponential 
random variable with unknown parameter A, and that the average time to serve a 
random sample of 20 customers is 5.1 minutes. A gamma distribution is used as 
a prior. Consider two cases: (1) the mean of the gamma is 0.5 and the standard 
deviation is 1, and (2) the mean is 10 and the standard deviation is 20. Plot the 
two posterior distributions and compare them. Find the two posterior means and 
compare them. Explain the differences. 


Suppose that 100 items are sampled from a manufacturing process and 3 are found 
to be defective. A beta prior is used for the unknown proportion 0 of defective 
items. Consider two cases: (1) a = b = 1, and (2) a = 0.5 and b = 5. Plot the 
two posterior distributions and compare them. Find the two posterior means and 
compare them. Explain the differences. 


This is a continuation of the previous problem. Let X = 0 or 1 according to 
whether an item is defective. For each choice of the prior, what is the marginal 
distribution of X before the sample is taken? What are the marginal distribu- 
tions after the sample is taken? (Hint: for the second question, use the posterior 
distribution of 0.) 


Suppose that a random sample of size 20 is taken from a normal distribution 
with unknown mean and known variance equal to 1, and the mean is found to 
be x — 10. A normal distribution was used as the prior for the mean, and it was 
found that the posterior mean was 15 and the posterior standard deviation was 
0.1. What were the mean and standard deviation of the prior? 


66. 


67. 
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Let the unknown probability that a basketball player makes a shot successfully 
be 0. Suppose your prior on @ is uniform on [0, 1] and that she then makes two 
shots in a row. Assume that the outcomes of the two shots are independent. 


a. What is the posterior density of 0? 
b. What would you estimate the probability that she makes a third shot to be? 


Evans (1953) considered fitting the negative binomial distribution and other dis- 
tributions to a number of data sets that arose in ecological studies. Two of these 
sets will be used in this problem. The first data set gives frequency counts of 
Glaux maritima made in 500 contiguous 20-cm? quadrants. For the second data 
set, a plot of potato plants 48 rows wide and 96 ft long was examined. The area 
was split into 2304 sampling units consisting of 2-ft lengths of row and in each 
unit the number of potato beetles was counted. Fit Poisson and negative binomial 
distributions, and comment on the goodness of fit. For these data, the method of 
moments should be fairly efficient. 








Count Glaux maritima Potato Beetles 
0 1 190 
1 15 264 
2 27 304 
3 42 260 
4 71 294 
5 TI 219 
6 89 183 
7 57 150 
8 48 104 
9 24 90 
10 14 60 
11 16 46 
12 9 29 
13 3 36 
14 1 19 
15 12 
16 11 
17 6 
18 10 
19 2 
20 4 
21 1 
22 3 
23 4 
24 1 
25 1 
26 0 
27 0 
28 1 
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Let X;,..., X, be ani.i.d. sample from a Poisson distribution with mean A, and 
let T — per Xi. 
a. Show that the distribution of X,,..., X, given T is independent of A, and 


conclude that T is sufficient for A. 

b. Show that X; is not sufficient. 

c. Use Theorem A of Section 8.8.1 to show that T is sufficient. Identify the 
functions g and A of that theorem. 


Use the factorization theorem (Theorem A in Section 8.8.1) to conclude that 
T = M5, Xi is a sufficient statistic when the X; are an i.i.d. sample from a 
geometric distribution. 


Use the factorization theorem to find a sufficient statistic for the exponential 
distribution. 


Let X,,..., X, bean iid. sample from a distribution with the density function 


0 <0 <œand0 xx «oo 


0) = ————, 
FD = t on 
Find a sufficient statistic for 0. 


Show that [ [7., X; and 57. , Xi are sufficient statistics for the gamma distribu- 
tion. 


Find a sufficient statistic for the Rayleigh density, 
X _ x2 2 
f(x|0) = 7 MESE x20 
Show that the binomial distribution belongs to the exponential family. 


Show that the gamma distribution belongs to the exponential family. 


9.1 


CHAPTER 9 


Testing Hypotheses and 
Assessing Goodness of Fit 


Introduction 


We will introduce some of the basic concepts of this chapter by means of a simple 
artificial example; it is important that you read the example carefully. Suppose that I 
have two coins, coin 0 has probability of heads equal to 0.5 and coin 1 has probability 
of heads equal to 0.7. I choose one of the coins, toss it 10 times and tell you the 
number of heads, but do not tell you whether it was coin 0 or coin 1. On the basis 
of the number of heads, your task is to decide which coin it was. How should your 
decision rule be? 
Let X denote the number of heads. Figure 9.1 gives p(x) for each of the coins. 





x 0 


1 2 3 4 5 6 7 8 9 10 











coinO | .0010 | .0098 | .0439 | .1172 | .2051 | .2461 | .2051 | .1172 | .0439 | .0098 | .0010 
coin 1 | .0000 | .0001 | .0014 | .0090 | .0368 | .1029 | .2001 | .2668 | 2335 | .1211 | .0282 






































FIGURE 9.1 


Suppose that you observed two heads. Then Po(2)/ P, (2) is about 30, which we 
will call the likelihood ratio—coin 0 was about 30 times more likely to produce this 
result than was coin |. This result would favor coin 0. On the other hand, if there were 
8 heads, the likelihood ratio would be .0439/.2335 — 0.19, which would favor coin 
1. The likelihood ratio will play a central role in the procedures we develop. 

We specify two hypotheses, Ho and Hj, according to whether coin 0 or coin | was 
tossed. We first develop a Bayesian methodology for assessing the evidence for each of 
the hypotheses. This approach requires the specification of prior probabilities P (Ho) 
and P(H,) for each of the hypotheses before observing any data. If you believed that 
Ihave no reason to choose coin 0 over coin 1, you would take P(Ho) = P(A) = 1/2. 
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After observing the number of heads, your posterior probabilities would be P (Ho|x) 
and P(Hj|x). The former would be 


P(Ho, x) 
P(x) 

_ P(x| Ho) PCHo) 

P(x) 


P(Ho|x) 


The ratio 
P(Holx) _ P(Ho) P(x|Ho) 
P(Hix) — PO) PH) 
is the product of the ratio of prior probabilities and the likelihood ratio. Thus, the 
evidence provided by the data is contained in the likelihood ratio, which is multiplied 
by the ratio of prior probabilities to produce the ratio of posterior probabilities. 
The likelihood ratio is evaluated in Figure 9.2. 
























































x 0 1 2 3 4 5 6 7 8 9 10 
ZG) | 165.4 | 70.88 | 30.38 | 13.02 | 5.579 | 2.391 | 1.025 | 0.4392 | 0.1882 | 0.0807 | 0.0346 
FIGURE 9.2 


(The numbers in Figure 9.2 do not precisely agree with the ratios of the numbers in 
the Figure 9.1 because the former are truncated to four decimal places.) The evidence 
x is increasingly favorable to Ho as x decreases, i.e., the likelihood ratio is monotonic 
in x. If one’s prior probabilities were equal, then for zero to six heads, Hp would be 
more probable and for seven to ten heads, H, would be more probable. If the prior 
probabilities change, the breakpoint changes. If you were asked to choose Ho or Hi 
on the basis of the data x, it seems reasonable that you would choose the hypothesis 
which had larger posterior probability. You would choose Ho if 


P(Ho|x) | P(Ho) PGx|Ho) 
P(H|x) P(A) P(x| Ah) 





or equivalently if 

P(x|Ho) 

P(x|Hi) 
where the critical value c depends upon your prior probability. Your decision would 
be based on the likelihood ratio: you accept Hp if the likelihood ratio is greater than 
c, and you reject Ho if the likelihood ratio is less than c. 

Let us now further examine the consequences of a particular decision rule, i.e., a 
particular specification of the constant c. First suppose that c — 1; then Ho is accepted 
as long as X < 6 and is rejected in favor of H, if X > 6. We can make two possible 
errors: reject Ho when it is true, or accept Ho when it is false. The probabilities of 
these two possible errors can be evaluated as follows: 


P (reject Ho| Ho) = P(X > 6| Ho) 
— 0.18 


Here we used the binomial probabilities above corresponding to Ho. Similarly, the 
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probability of the other kind of error is 


P(accept Ho| Hj) = P(X < 6|Hi) 
= 0.35 


Now suppose that c = 0.1, which corresponds to P(Ho)/P(4,) = 10. Then 
from Figure 9.2, Hp is accepted when X < 8. Compared to equal odds, more extreme 
evidence is required to reject Hp because the prior probabilities greatly favor Ho. 
Then the probabilities of the two types of errors are 


P (reject Ho| Ho) = P(X > 8| Ho) 
— 0.01 

P (accept Ho| Hi) = P(X < 8|Hi) 
— 0.85 


In this way, we see that there is a correspondence between the prior probabilities 
and the probabilities of the two types of errors. The constant c controls the tradeoff 
between the probabilities of the two types of errors. 


The Neyman-Pearson Paradigm 


Rather than using a Bayesian approach, Neyman and Pearson formulated their theory 
of hypothesis testing by casting it as a decision problem and making the probabilities 
of the two types of errors central, thus bypassing the necessity of specifying prior 
probabilities. In doing so, this approach introduced an asymmetry: one hypothesis is 
singled out as the null hypothesis and the other as the alternative hypothesis, the 
former usually denoted by Ho and the latter by Hı or H4. We will see later through 
examples how this specification is naturally made, but for now we will continue with 
the example of the previous section and arbitrarily declare Hp to the null hypothesis. 
The following terminology is standard: 


Rejecting Ho when it is true is called a type I error. 

The probability of a type I error is called the significance level of the test and is 
usually denoted by o. 

Accepting the null hypothesis when it is false is called a type II error and its 
probability is usually denoted by £. 

The probability that the null hypothesis is rejected when it is false is called the 
power of the test, and equals | — p. 

We have seen in this example how rejecting Ho when the likelihood ratio is less 
than a constant c is equivalent to rejecting when the number of heads is greater than 
some value xo. The likelihood ratio, or equivalently, the number of heads, is called 
the test statistic. 

The set of values of the test statistic that leads to rejection of the null hypothesis is 
called the rejection region, and the set of values that leads to acceptance is called 
the acceptance region. 

The probability distribution of the test statistic when the null hypothesis is true is 
called the null distribution. 
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In the example in the introduction to this chapter, the null and alternative hy- 
potheses each completely specify the probability distribution of the number of heads, 
as binomial(10,0.5) or binomial(10,0.7), respectively. These are called simple 
hypotheses. The Neyman-Pearson Lemma shows that basing the test on the like- 
lihood ratio as we did is optimal: 


NEYMAN-PEARSON LEMMA 


Suppose that Ho and H, are simple hypotheses and that the test that rejects Ho 
whenever the likelihood ratio is less than c and significance level o. Then any 
other test for which the significance level is less than or equal to æ has power 
less than or equal to that of the likelihood ratio test. E 


The point is that there are many possible tests. Any partition of the set of possible 
outcomes of the observations into a set that has probability less than or equal to o when 
the null hypothesis is true and its complement, and that rejects when the observations 
are in the complement has significance level less than or equal to a by construction. 
Among all such possible partitions, that based on the likelihood ratio maximizes the 
power. 


Proof 


Let f(x) denote the probability density function or frequency function of the 
observations. A test of Ho : f(x) = fo(x) versus H, : f(x) = fa(x) amounts to 
using a decision function d (x), where d(x) = 0 if Ho is accepted and d(x) = 1 
if Ho is rejected. Since d(X) is a Bernoulli random variable, E(d(X)) — P 
(d(X) = 1). The significance level of the test is thus a = Po(d(X) = 1) 
= Eo(d(X)), and the power is PA(d(X) = 0) = E4(d(X)). Here Eo denotes 
expectation under the probability law specified by Hp, etc. 

Let d (X) correspond to the likelihood ratio test: d(x) = 1 if fo(x) < cfA(x) 
and Eo(d(X)) = a. Let d* (x) be the decision function of another test satisfying 
Eo(d*(X)) € Eo(d(X)) = a. We will show that E4(d*(X)) € E,(d(X)). This 
will follow from the key inequality 


d“ (x)[cfaA Qo) — PN x dG9Eefa Q) — foG9] 


which holds since if d(x) = 1, cfA(x) — fo(x) > O and if d(x) = 0, cfA(x) — 
fo(x) x 0. Now integrating (or summing) both sides of the inequality above with 
respect to x gives 


cE4(d*(X)) — Eo(d*(X)) € cEA(d(X)) — Eo(d(X)) 
and thus 
Eo(d(X)) — Eo(d*(X)) < c[EA(d (X)) — EA(d* (X))] 


The conclusion follows since the left-hand side of this inequality is nonnegative 
by assumption. [| 
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Let X4, ..., X, bearandomsample from a normal distribution having known variance 
c?. Consider two simple hypotheses: 


Ho: u = [Lo 

Ay: u = pi 
where uo and u, are given constants. Let the significance level o be prescribed. The 
Neyman-Pearson Lemma states that among all tests with significance level a, the test 


that rejects for small values of the likelihood ratio is most powerful. We thus calculate 
the likelihood ratio, which is 


=e 
exp | > )_ (Xi — bo)? 
fo(X) E 2. 
AQ) | S 
ae X; — wy 
exp | 55 (Xi — u) 
since the multipliers of the exponentials cancel. Small values of this statistic corre- 


spond to small values of 7; ,(X; — 41)? — $5; (X; — uo). Expanding the squares, 
we see that the latter expression reduces to 





2nX (uo — ui) + nui — ni 


Now, if uo — uı > 0, the likelihood ratio is small if X is small; if wo — uı < 0, the 
likelihood ratio is small if X is large. To be concrete, let us assume the latter case. 
We then know that the likelihood ratio is a function of X and is small when X is 
large. The Neyman-Pearson lemma thus tells us that the most powerful test rejects 
for X > xo for some xo, and we choose x so as to give the test the desired level o. 
That is, xo is chosen so that P(X > x9) = « if Ho is true. Under Hj in this example, 
the null distribution of X is a normal distribution with mean Ho and variance c? /n, 
SO xo can be chosen from tables of the standard normal distribution. Since 


X — uo LAS 


ojn ^ o[4fn 





P(X > x) = »( 


we can solve 





for xo in order to find the rejection region for a level o test. Here, as usual, z (œ) denotes 
the upper o point of the standard normal distribution; that is, if Z is a standard normal 
random variable, P(Z > z(a)) =a. E 


This example is typical of the way that the Neyman-Pearson Lemma is used. We 
write down the likelihood ratio and observe that small values of it correspond in a 
one-to-one manner with extreme values of a test statistic, in this case X. Knowing 
the null distribution of the test statistic makes it possible to choose a critical level that 
produces a desired significance level o. 
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EXAMPLE B 


EXAMPLEC 


9.2.1 


Unfortunately, the Neyman-Pearson Lemma is of little direct utility in most 
practical problems, because the case of testing a simple null hypothesis versus a 
simple alternative is rarely encountered. If a hypothesis does not completely specify 
the probability distribution, the hypothesis is called a composite hypothesis. Here 
are some examples: 


Goodness-of-Fit Test 

Let X1, X2,..., X, be a sample from a discrete probability distribution. The null 
hypothesis could be that the distribution is Poisson with some unspecified mean, and 
the alternative could be that the distribution is not Poisson. For example, we might 
want to test whether a Poisson model is reasonable for the data of Example A in 
Section 8.4. Since the null hypothesis does not completely specify the distribution 
of the X;'s, it is composite. If the null hypothesis were refined to state that the 
distribution was Poisson with some specified mean, then it would be simple. The 
alternative hypothesis does not completely specify the distribution, so it is composite. 
We will take up the subject of testing for goodness of fit later in this chapter. E 


Testing for ESP 

Consider a hypothetical experiment in which a subject is asked to identify, without 
looking, the suits of 20 cards drawn randomly with replacement from a 52 card deck. 
Let T be the number of correct identifications. The null hypothesis states that the 
person is purely guessing, and the alternative states that the person has extrasen- 
sory ability. The null hypothesis is simple because then T is binomial(20,0.25). The 
alternative does not completely specify the distribution of T, so it is composite; note 
that it does not even specify that the distribution is binomial. E 


This example is useful for further illustrating two other issues that arise in hy- 
pothesis testing: the specification of the significance level and the choice of the null 
hypothesis. 


Specification of the Significance Level 
and the Concept of a p-value 


One of the strengths of the Neyman-Pearson approach is that only the distribution 
under the null hypothesis is needed in order to construct a test. In Example C above, 
it would be conventional and convenient to take the null hypothesis to be that of pure 
guessing; we discuss this further in the next section. In this case, the null distribution 
of T is binomial(20,0.25). Because large values of T tend to lend credence to the 
alternative, the rejection region would be of the form {T > tọ} where tọ is chosen 
so that P(T > to| Ho) = a, the desired significance level of the test. For example, 
from calculating binomial probabilities, we find P(T > 12) = .0009, so for this 
choice of the critical region, the null hypothesis of no ESP ability would be falsely 
rejected only with probability about one in a thousand. Note that we did not need to 
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specify the form of the probability distribution under the alternative, but used only 
the notion that if the alternative is true, the subject would be expected to correctly 
identify more suits than if purely guessing. In comparison, a fully Bayesian treat- 
ment would have to specify the distribution under the alternative as well as prior 
probabilities. 

The theory requires the specification of the significance level, œ, in advance 
of analyzing the data, but gives no guidance about how to make this choice. In 
practice it is almost always the case that the choice of o is essentially arbitrary, but 
is heavily influenced by custom. Small values, such as 0.01 and 0.05, are commonly 
used. Another criticism of the paradigm is that it is built on the assumption that one 
must either reject or not reject a hypothesis, when typically no such decision is actually 
required. The theory is thus often applied in a hypothetical manner. For example, 
suppose that the subject above guessed the suit correctly nine times. Since P(T > 
9| Ho) — .041, the null hypothesis would have been "rejected" at the significance 
level æ = .05, if one were actually “rejecting” or “not rejecting.” Thus, the evidence 
is often summarized as a p-value, which is defined to be the smallest significance 
level at which the null hypothesis would be rejected. If nine suits were identified 
correctly, the p-value would be 0.041. If ten suits were identified, it would be 0.014, 
since P(T > 10| Ho) = .014, etc. 

The use of a p-value to summarize the evidence against the null hypothesis was 
advocated by the eminent statistician Sir Ronald Fisher. But rather than casting it 
within a hypothetical framework of “rejection,” he thought of the p-value as being 
the probability under the null hypothesis of a result as or more extreme than that 
actually observed. So for example, in the case that ten suits are identified, the p-value 
is the chance of someone getting at least that many correct by purely guessing. The 
smaller the p-value, the stronger the evidence against the null hypothesis. 

The Bayesian paradigm summarizes the evidence for and against the null hy- 
pothesis as a posterior probability. Its application depends on specifying probability 
models under both the null and the alternative and on assigning meaningful prior prob- 
abilities. It is important to understand that a p-value is not the posterior probability 
that the null hypothesis is true. To reiterate, the p-value is the probability of a result 
as or more extreme than that actually observed if the null hypothesis were true. This 
is a probability, but it is not the posterior probability that the null hypothesis is true; 
the latter depends on the specification of prior probabilities. Consider the example of 
Section 9.1. If x = 8 heads are observed, the p-value is .0439 + .0098 + .0010 = 
.0546, or about 5%. Suppose that the prior probabilities were equal. The likelihood 
ratio is 0.1882 = P(Ho|x)/(1 — P(Ho|x)) from which it follows that P(Ho|x) = 
0.1584, or about 16%. 


The Null Hypothesis 


As should be clear by now, there is an asymmetry in the Neyman-Pearson paradigm 
between the null and alternative hypotheses. The decision as to which is the null 
and which is the alternative hypothesis is not a mathematical one, and depends on 
scientific context, custom, and convenience. This will gradually become clearer as 
we see more real examples in this and later chapters, and for now we will make only 
the following remarks: 
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9.2.3 


EXAMPLEA 


* [n Example B of Section 9.2, we chose as the null hypothesis the hypothesis that 
the distribution was Poisson and as the alternative hypothesis the hypothesis that 
the distribution was not Poisson. In this case, the null hypothesis is simpler than the 
alternative, which in a sense contains more distributions than does the null. It is 
conventional to choose the simpler of two hypotheses as the null. 

* The consequences of incorrectly rejecting one hypothesis may be graver than those 
of incorrectly rejecting the other. In such a case, the former should be chosen as the 
null hypothesis, because the probability of falsely rejecting it could be controlled 
by choosing o. Examples of this kind arise in screening new drugs; frequently, it 
must be documented rather conclusively that a new drug has a positive effect before 
it is accepted for general use. 

In scientific investigations, the null hypothesis is often a simple explanation that 

must be discredited in order to demonstrate the presence of some physical phe- 

nomenon or effect. The hypothetical ESP experiment referred to earlier falls in this 
category; the null hypothesis states that the subject is merely guessing, that there 
is no ESP. The validity of the null hypothesis would not be cast in doubt unless the 
results would be extremely unlikely under the null. We will see other examples of 
this type beginning in Chapter 11. 


Uniformly Most Powerful Tests 


The optimality result of the Neyman-Pearson Lemma requires that both hypotheses be 
simple. In some cases, the theory can be extended to include composite hypotheses. If 
the alternative H, is composite, a test that is most powerful for every simple alternative 
in H; is said to be uniformly most powerful. 


Continuing with Example A of Section 9.2, consider testing Ho : u = p versus 
H; : u > uo. In Example A, we saw that for a particular alternative u; > uo, the most 
powerful test rejects for X > xo, where xo depends on jo, c, n, but not on u. Because 
this test is most powerful and is the same for every alternative, it is uniformly most 
powerful. a 


It can also be argued that the test is uniformly most powerful for testing Ho : u < 
Ho Versus H; : u > uo. Butitis not uniformly most powerful for testing Ho : y = Ho 
versus H; : u zz uo. This follows from further examination of the example, which 
shows that the test that is most powerful against the alternative that y > 49 rejects for 
large values of X — uo, whereas the test that is most powerful against the alternative 
u < po rejects for small values of X — uo. The most powerful test is thus not the 
same for every alternative. 

In typical composite situations, there is no uniformly most powerful test. The al- 
ternatives Hı : y < po and Hj : u > mo are called one-sided alternatives. The 
alternative H, : u + Ho is a two-sided alternative. 
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The Duality of Confidence Intervals 

and Hypothesis Tests 

There is a duality between confidence intervals (more generally, confidence sets) and 
hypothesis tests. In this section, we will show that a confidence set can be obtained by 


“inverting” a hypothesis test, and vice versa. Before presenting the general structure, 
we consider an example. 


Let X;,..., X, be a random sample from a normal distribution having unknown 
mean u and known variance o°. We consider testing the following hypotheses: 


Ho: u = [Lo 
Ha: u X uo 


Consider a test at a specific level o that rejects for |X — uol > xo, where xo is 
determined so that P(IX — Hol > xo) = « if Ho is true: x9 = oyz(o/2). Here the 
standard deviation of X is denoted by ox = o/,/n. The test thus accepts when 
IX — fo] < exz(o/2) 
Or 
—oyz(a/2) < X — [lo < oxz(a/2) 
or 
x= oxz(a/2) < uo < X+ oxz(a/2) 
A 100(1 — a)% confidence interval for uo is 
[X — oxz(o/2), X + o¢z(o/2)] 


Comparing the acceptance region of the test to the confidence interval, we see that 
Ho lies in the confidence interval for jz if and only if the hypothesis test accepts. In 
other words, the confidence interval consists precisely of all those values of uo for 
which the null hypothesis Ho: y = [Lo is accepted. E 


We now demonstrate that this duality holds more generally. Let 0 be a parameter 
of a family of probability distributions, and denote the set of all possible values of 0 
by ©. Denote the random variables constituting the data by X. 
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THEOREMA 


Suppose that for every value 0, in © there is a test at level o of the hypothesis 
Ho: 0 = 09. Denote the acceptance region of the test by A (0o). Then the set 


C(X) = (6: X € A(0)) 


is a 100(1 — w)% confidence region for 0. 


Proof 
Because A is the acceptance region of a test at level a, 
P[X € A(@)|9 = 09] = 1—« 
Now, 
P[0o € C(X)|0 = 09] = P[X € A(09)|0 = 0o] 
=l-a 


by the definition of C(X). L| 


It is helpful to state Theorem A in words: A 100(1 — w)% confidence region for 
0 consists of all those values of 6) for which the hypothesis that 6 equals 6) will not 
be rejected at level o. 


THEOREM B 


Suppose that C(X) is a 100(1 — w)% confidence region for 0; that is, for every 
Oo, 

RECAI 
Then an acceptance region for a test at level o of the hypothesis Ho: 0 = @ is 


A(0) = {X| € C(X)} 


Proof 


The test has level o because 


P(X € A(05)|B = 0) = P(0; € C(X)|0 20) 21—« H 


In words, Theorem B says that the hypothesis that 0 equals 0 is accepted if 6o 
lies in the confidence region. 

This duality can be quite useful. In some situations, it is possible to form confi- 
dence intervals for parameters of probability distributions and then use those intervals 
to test hypotheses about the values of those parameters. In other situations, it may 
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be relatively easy to test hypotheses and then determine the acceptance regions for 
the test to form confidence intervals that might have been quite difficult to derive 
in a more direct manner. We will see examples of both types of situations in later 
chapters. 


Generalized Likelihood Ratio Tests 


The likelihood ratio test is optimal for testing a simple hypothesis versus a simple 
hypothesis. In this section, we will develop a generalization of this test for use in 
situations in which the hypotheses are not simple. Such tests are not generally optimal, 
but they are typically nonoptimal in situations for which no optimal test exists, and 
they usually perform reasonably well. Generalized likelihood ratio tests have wide 
utility; they play the same role in testing as mle's do in estimation. 

It is frequently the case that the hypotheses under consideration specify, or par- 
tially specify, the values of parameters of the probability distribution that has generated 
the data. Specifically, suppose that the observations X = (X,,..., X,) have a joint 
density or frequency function f (x|0). Then Ho may specify that 9 € wo, where wo is 
a subset of the set of all possible values of 0, and H; may specify that 0 € w1, where 
€ is disjoint from wọ. Let Q = wo U c. Based on the data, a plausible measure of the 
relative tenability of the hypotheses is the ratio of their likelihoods. If the hypothe- 
ses are composite, each likelihood is evaluated at that value of 0 that maximizes it, 
yielding the generalized likelihood ratio 


max(lik()] 
* 0€o9 


B max[lik(6)] 


Small values of A* tend to discredit Ho. 
It is preferable for certain technical reasons to use the test statistic 


max[lik(0)] 


0€oo 


E max[Hik(0)] 


rather than A*. Note that A = min(A*, 1) so small values of A* correspond to small 
values of A. The rejection region for a likelihood ratio test consists of small values of 
A, for example, all A < Ao. The threshhold A, is chosen so that P(A < Ao9| Ho) = a, 
the desired significance level of the test. 

We now illustrate the construction of a likelihood ratio test with a simple example. 


Testing a Normal Mean 

Let X,,..., X, be i.i.d. and normally distributed with mean u and variance c?, 
where o is known. We wish to test Ho: y = uo against Hj:  ~ Ho, where uo is a 
prescribed number. The role of 0 is played by u, and wọ = (uo), @1 = {u|u Æ Mo}, 
and Q = (—oo < u < oo}. 
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Since oy consists of only one point, the numerator of the likelihood ratio statistic 


1 e 33 Doin Xi Ho)” 
(o /27)" 
For the denominator, we have to maximize the likelihood for jz € Q, which is achieved 
when wu is the mle X. The denominator is the likelihood of X;, X5, ..., X, evaluated 
with u = X: 


The likelihood ratio statistic is, therefore, 


1 n n "n 
A = exp (zs So mo)? 0G xy 
i=l 


i=l 
Rejecting for small values of A is equivalent to rejecting for large values of 


1 n n m 
-2wa = (Y mo) — 3 (X; x») 


i=] 














Using the identity 


0G = uo? = 3G — X + n(X = po)? 
i=l i=l 

we see that the likelihood ratio test rejects for large values of —2log A = n(X — 
119) /o?. The distribution of this statistic under Ho is chi-square with 1 degree 
of freedom. This follows, since under Ho, X ^ N (uo, o? /n), which implies that 
A/n(X — uo)/a ~ N(0, 1) and hence its square, —2 log A ^ x?. Knowing the null 
distribution of the test statistic makes possible the construction of a rejection region 
for any significance level o: The test rejects when 


T. oe 2 2 
z“ — uo) > xi (@) 


Again using the fact that a chi-square random variable with 1 degree of freedom is 
the square of a standard normal random variable, we can rewrite this relation to show 
that the rejection region for the test is 


[X — mol > uem " 


The preceding derivation has been rather formal, but upon examination the result 
looks perfectly reasonable or perhaps even so obvious as to make us doubt the value of 
the formal exercise: The test of Ho: y = jo versus Hi: u Æ Ho rejects when | X — uol 
is large. The test does not reject when —oz(o/2)/ /n < X — uo < oz(o/2)/ n or, 
equivalently, when X — oz(a/2)/./n < uo € X + oz(a/2)/ An. That is, the test 
does not reject when uo lies in a 100(1 — w)% confidence interval for u. Compare to 
Example A of Section 9.3. 
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In order for the likelihood ratio test to have the significance level œ, 49 must be 
chosen so that P(A < Ao) = a if Ho is true. If the sampling distribution of A under 
Ho is known, we can determine Ao. Generally, the sampling distribution is not of a 
simple form, but in many situations the following theorem provides the basis for an 
approximation to the null distribution. 


THEOREM A 


Under smoothness conditions on the probability density or frequency functions 
involved, the null distribution of —21og A tends to a chi-square distribution 
with degrees of freedom equal to dim Q — dim «x as the sample size tends to 
infinity. L| 


The proof, which is beyond the scope of this book, is based on a second-order 
Taylor series expansion. 

In the statement of Theorem A, dim Q and dim ox are the numbers of free pa- 
rameters under Q and wo, respectively. In Example A, the null hypothesis completely 
specifies u and o?; there are no free parameters under œp, so dim wọ = 0. Under 
Q, o is fixed but u is free, so dim Q = 1. For this example, the null distribution of 
—2 log A is exactly x2. 


Likelihood Ratio Tests for the 
Multinomial Distribution 


In this section we will develop a generalized likelihood ratio test of the goodness of fit 
of a model for multinomial cell probabilities. Under the model, the vector of cell prob- 
abilities p is described by a hypothesis Ho, which specifies that p = p(0), 0 € œo, 
where 0 is a parameter that may be unknown. For example, in Section 8.2 we consid- 
ered fitting Poisson probabilities that depended on an unknown parameter (there 
called A, which played the role of 0) to the cell counts in a table. We want to 
judge the plausibility of the model relative to a model H, in which the cell prob- 
abilities are free except for the constraints that they are nonnegative and sum to 1. 
If there are m cells, & is thus the set consisting of m nonnegative numbers that 
sum to 1. 
The numerator of the likelihood ratio is 


1 
max (— 5) ni O 
xı! ! 


pew I Xm! 


where the x; are the observed counts in the m cells. By the definition of the maximum 
likelihood estimate, this likelihood is maximized when Ô is the maximum likelihood 
estimate of 0. The corresponding probabilities will be denoted by p; (ô). 
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Since the probabilities are unrestricted under €2, the denominator is maximized 
by the unrestricted mle's, or 


pic 
n 
The likelihood ratio is, therefore, 


pi(Q)* = +» p, (Qv 





au a Xm 


n “++ Pm 


Also, since x; = ni, 





m f 6 
—2 log A = -2n X fi log (28 » 
Di 
isl 


m 


=23 0:10 ( ‘) 


where O; = np; and E; = np; (Ê) denote the observed and expected counts, respec- 
tively. 

Under Q, the cell probabilities are allowed to be free, with the constraint that 
they sum to 1, so dim Q = m — 1. If, under Hp, the probabilities p;(0) depend on a 
k-dimensional parameter 0 that has been estimated from the data, dim wọ = k. The 
large sample distribution of —2 log A is thus a chi-square distribution with m — k — 1 
degrees of freedom (the number of cells minus the number of estimated parameters 
minus 1). 

Pearson's chi-square statistic is commonly used to test for goodness of fit 


> wo bu — np: (Â) 
ea np: (ô) 


Pearson's statistic and the likelihood ratio are asymptotically equivalent under Ho. 
To indicate heuristically why this is so, we will go through a Taylor series argument. 
To begin, 


—2log A = Y^ P log (- Li 


i=1 


If Ho is true and n is large, p; ^ p; (Ê). The Taylor series expansion of the function 


x 
f(x) = x log (=) 


1 gal 
f@)=( Xo) + 4 Xo) deese 
Xo 


about x is 
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Thus, 


m 


m . 5; — p; 6 2 
-210g A 2n) Li = np en D IRE 
i=1 i=1 i 


The first term on the right-hand side is equal to 0 since the probabilities sum to 1, and 
the second term on the right-hand side may be expressed as 


3 [x; — np; (0) 
i=l npi(0) 
since x;, the observed count, equals n p; for i = 1,...,m. 

We have argued for the approximate equivalence of two test statistics. Pearson's 
test has been more commonly used than the likelihood ratio test, because it is some- 
what easier to calculate without the use of a computer. 

Let us consider some examples. 


Hardy-Weinberg Equilibrium 

Hardy-Weinberg equilibrium was first introduced in Example A in Section 8.5.1. 
We will now test whether this model fits the observed data. Recall that the Hardy- 
Weinberg equilibrium model says that the cell probabilities are (1 — 6), 20(1 — 6), 
and 0?. Using the maximum likelihood estimate for 0, Ê = .4247, and multiplying the 
resulting probabilities by the sample size n — 1029, we calculate expected counts, 
which are compared with observed counts in the following table: 














Blood Type 
M MN N 
Observed | 342 500 187 
Expected | 340.6 502.8 185.6 


The null hypothesis will be that the multinomial distribution is as specified by the 
Hardy-Weinberg equilibrium frequencies, with unknown parameter 0. The alternative 
hypothesis will be that the multinomial distribution does not have probabilities of that 
specified form. We first choose a value for o, the significance level for the test (recall 
that the significance level is the probability of falsely rejecting the hypothesis that the 
multinomial cell, probabilities are as specified by genetic theory). In this application, 
there is no compelling reason to choose any particular value of o, so we will follow 
convention and let a = .05. This means that our decision rule will falsely reject Ho 
in only 5% of the cases. 

We will use Pearson's chi-square test, and therefore X? as our test statistic. The 
null distribution of X? is approximately chi-square with 1 degree of freedom. (There 
are two independent cells, and one parameter has been estimated from the data.) 
Since, from Table 3 in Appendix B, the point defining the upper 5% of the chi-square 
distribution with 1 degree of freedom is 3.84, the test rejects if X? > 3.84. We next 
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calculate X?: 
(0 — EY 
X?’ = ——— = .0319 
205m 


Thus, the null hypothesis is not rejected. 

There is a certain unnecessary rigidity in this procedure, because it is not clear 
that such a decision (to reject or not) has to be made at all. There is also a certain 
arbitrariness: There was no strong reason to let a = .05, but that choice essentially 
determined our decision. If we had let aw = .01, the decision would have been the 
same since x7(.01) > x7(.05), but what if we had let œ = .10, or .20? It is here that 
the concept of the p-value becomes useful. Recall that the p-value is the smallest 
significance level at which the null hypothesis would be rejected. From a computer 
calculation of the chi-square distribution (or from a table of the normal distribution, 
since a chi-square distribution with | degree of freedom is the square of a standard 
normal random variable), the probability that a chi-square random variable with 1 
degree of freedom is greater than or equal to .0319 is .86, which is the p-value. 
Another interpretation of this p-value is that if the model were correct, deviations 
this large or larger would occur 86% of the time. Thus, the data give us no reason to 
doubt the model. 

In comparison, the likelihood ratio test statistic is 


3 
Oi 
—2log A = 2X O; log (2) = .0319 
i=l í 


The two tests lead to the same conclusion. 

Finally, we note that the actual maximized likelihood ratio is A = exp(—.0319/2) 
= .98. Thus the Hardy-Weinberg model is almost as likely as the most general possible 
model. E 


Bacterial Clumps 

In testing milk for bacterial contamination, 0.01 ml of milk is spread over an area 
of 1 cm? on a slide. The slide is mounted on a microscope, and counts of bacterial 
clumps within grid squares are made. The Poisson model appears quite reasonable 
for the distribution of the clumps at first glance: The clumps are presumably mixed 
uniformly throughout the milk, and there is no reason to suspect that the clumps 
bunch together. However, on closer examination, two possible problems are noted. 
First, bacteria held by surface tension on the lower surface of the drop may adhere to 
the glass slide on contact, producing increased concentrations in that area of the film. 
Second, the film is not of uniform thickness, being thicker in the center and thinner at 
the edges, giving rise to nonuniform concentrations of bacteria. The following table, 
taken from Bliss and Fisher (1953), summarizes the counts of clumps on 400 grid 
squares. 


Number per Square | 0 1 2 3 4 5 6 7 8 9 J0 19 


Frequency | 56 104 80 62 42 27 9 9 5 3 2 1 
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To fit a Poisson distribution to these data, we compute the mle, À; which is the 
mean of the 400 counts: 


0x 56+1x 104+2x 80+---4+19x 1 
400 





R= 
= 2.44 
The following table shows the observed and expected counts and the components 


of chi-square test statistic. (The last several cells were grouped together so that the 
minimum expected count would be nearly 5.) 








Observed 56 104 80 62 42 21 9 20 
Expected 34.9 85.1 103.8 844 S515 25.1 10.2 5.0 
Component of X? 12.8 4.2 5.5 5.9 1.8 14 14 45.0 





The chi-square statistic is X? = 75.4. With 6 degrees of freedom (there are eight 
cells, and one parameter has been estimated from the data), the null hypothesis is 
conclusively rejected [ xe (.005) = 18.55, so the p-value is less than .005]. When a 
goodness-of-fit test rejects, it is instructive to find out why; where does the model fail 
to fit? This can be seen by looking at the cells that make large contributions to X? and 
the signs of the observed minus the expected counts for those cells. We see here that 
the greatest contributions to X? come from the first and last cells of the table—there 
are too many small counts and too many large counts relative to what is expected for 
a Poisson distribution. L| 


Fisher's Reexamination of Mendel's Data 

In one of his famous experiments, Mendel crossed 556 smooth, yellow male peas 
with wrinkled, green female peas. According to now established genetic theory, the 
relative frequencies of the progeny should be as given in the following table. 








Type Frequency 
Smooth yellow 2 
Smooth green x 
Wrinkled yellow x 
Wrinkled green i 





The counts that Mendel recorded and the expected counts are given in the following 
table: 
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Type Observed Count Expected Count 
Smooth yellow 315 312.75 
Smooth green 108 104.25 
Wrinkled yellow 102 104.25 
Wrinkled green 31 34.75 


Calculating the likelihood ratio test statistic, we obtain 


4 
O; 
-2log A = 2X O; log (2) = .618 


i=] 


Comparing this value with the chi-square distribution with 3 degrees of freedom (three 
independent parameters are estimated under Q and none under wo), we have a p-value 
of slightly less than .9. Pearson’s chi-square statistic is .604, which is quite close to 
the value from the likelihood ratio test. We interpret the p-value as meaning that, 
even if the model were correct, discrepancies this large or larger would be expected 
to occur on the basis of chance about 90% of the time. There is thus no reason 
to reject the hypothesis that the counts come from a multinomial distribution with 
the prescribed probabilities. We would tend to doubt this hypothesis for only small 
p-values. 

The p-value can also be interpreted to mean that on the basis of chance we would 
expect agreement this close or closer about only 10% of the time. There is some 
validity to the suggestion that the data agree with the model too well; if the p-value 
had been .999, for example, we would definitely be suspicious. 

Fisher pooled the results of all of Mendel’s experiments in the following way. 
Suppose that two independent experiments give chi-square statistics with p and r 
degrees of freedom, respectively. Then, under the null hypothesis that the models were 
correct, the sum of those two test statistics would follow a chi-square distribution with 
p +r degrees of freedom. Fisher added the chi-square statistics for all the independent 
experiments and compared the result with the chi-square distribution with degrees of 
freedom equal to the sum of all the degrees of freedom. The resulting p-value was 
.99996. Such close agreement would only occur 4 times out of 100,000 on the basis 
of chance! 

What happened? Did Mendel deliberately or unconsciously fudge the data? Did 
he have an overzealous lab technician who was hoping for a recommendation to 
medical school? Was there divine intervention? Perhaps the best explanation is that 
Mendel continued experimenting until the results looked good and then he stopped. 
The statistical analysis here assumes that the sample size is fixed before the data are 
collected. L| 


Mendel is not the only scientist whose data are too good to be true. Cyril 
Burt was an English psychologist whose work had a great impact on the debate 
concerning the genetic basis for intelligence. His many papers and extensive data 
argue for such a basis. In 1946, Burt became the first psychologist to be knighted; 
however, during the 1970s, his work came under increasing attack, and he was 
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accused of actually fabricating data. One of his most famous studies was of the intel- 
ligence and occupational status of 40,000 fathers and sons. Dorfman (1978) studied 
the goodness of fit of these intelligence scores to a normal distribution, using Pear- 
son's chi-square test. The p-values for fathers and sons were greater than 1 — 1077 
and 1 — 1075, respectively! Dorfman concluded that “it may well be that Burt's fre- 
quency distributions are the most normally distributed in the history of anthropometric 
measurement." 


The Poisson Dispersion Test 


The likelihood ratio test and Pearson's chi-square test are carried out with respect to 
the general alternative hypothesis that the cell probabilities are completely free. If one 
has a specific alternative hypothesis in mind, better power can usually be obtained 
by testing against that alternative rather than against a more general alternative. Such 
a test is developed in this section for the hypothesis that a distribution is Poisson. The 
test is quite useful, and its construction affords another illustration of a generalized 
likelihood ratio test. 

The two key assumptions underlying the Poisson distribution are that the rate 
is constant and that the counts in one interval of time or space are independent of 
the counts in disjoint intervals. These assumptions are often not met. For exam- 
ple, suppose that insects are counted on leaves of plants. The leaves are of different 
sizes and occur at various locations on different plants; the rate of infestation may 
well not be constant over the different locations. Furthermore, if the insects hatched 
from eggs that were deposited in groups, there might be clustering of the insects 
and the independence assumption might fail. If counts occurring over time are being 
recorded, the underlying rate of the phenomenon being studied might not be con- 
stant. Motor vehicle counts for traffic studies, for example, typically vary cyclically 
over time. 

Given counts x1, ..., Xn, We consider testing the null hypothesis that the counts 
are Poisson with the common parameter à versus the alternative hypothesis that 
they are Poisson but have different rates, A,;,...,A,. Under Q, there are n dif- 
ferent rates; œo C Q is the special case that they are all equal. Under o, the 
maximum likelihood estimate of A is A = X. Under Q, the maximum likelihood 
estimates ofthe A; are x1, ..., Xn, we denote these estimates by À;. The likelihood ratio 
is thus 
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The likelihood ratio test statistic is 


—2log A = -25` E log (=) + (x - ») 


i=l 
n Xi 
=2 1 x; log (=) 
X 
i=l 


A nearly equivalent form of this statistic is produced using the Taylor series argument 
given in Section 9.5: 


1 n 
—2logA ~ — ;— Xy 
per 2,0 x) 


Under Q, there are n independent parameters, A1,..., Àn, so dimQ = n. Under 
wo, there is only one parameter, A, so dim wọ = 1, and the degrees of freedom are 
n — ]. 

The last equation given for the test statistic may be interpreted as being the ratio of 
n times the estimated variance to the estimated mean. For the Poisson distribution, the 
variance equals the mean; for the types of alternatives discussed above, the variance 
is typically greater than the mean. For this reason the test is often called the Poisson 
dispersion test. It is sensitive to—that is, has high power against—alternatives that 
are overdispersed relative to the Poisson, such as the negative binomial distribution. 
The ratio ó?/X is sometimes used as a measure of clustering. 


Asbestos Fibers 

In Example A in Section 8.4, we considered whether counts of asbestos fibers on grid 
squares could be modeled as a Poisson distribution. Applying the Poisson dispersion 
test, we find that 


I X =)2 
= (xi = X) — 26.56 
X 


or, if the likelihood ratio statistic 1s used, 


Xj 
ay xd (+) =27.0 
yo 


Since there are 23 observations, there are 22 degrees of freedom. From a computer 
calculation, the p-value for the likelihood ratio statistic is .21. The evidence against 
the null hypothesis is not persuasive; however, the sample size is small and the test 
may have low power. E 


Bacterial Clumps 
In Example B in Section 9.5, we applied Pearson’s chi-square test to test whether 
counts of bacteria clumps in milk were fit by a Poisson distribution. There we found 
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X — 2.44. The sample variance is 





a 0 x56412x104 9-19? x1 , 
O = X 
400 

— 4.59 


The ratio of the variance to the mean is 1.88 rather than 1; the test statistic is 


|. 400 x 4.59 
^... 240 


= 732.7 


Under the null hypothesis, the statistic approximately follows a chi-square distribution 
with 399 degrees of freedom. Since a chi-square random variable with m degrees of 
freedom is the sum of the squares of m independent N (0, 1) random variables, the 
central limit theorem implies that for large values of m the chi-square distribution 
with m degrees of freedom is approximately normal. For a chi-square distribution, the 
mean equals the number of degrees of freedom and variance equals twice the number 
of degrees of freedom. The p-value can thus be found by standardizing the statistic 
and using tables of the standard normal distribution: 





T=399 . 750.7— 399 
P(T > 752.7) = P ( ) 


> 
J2x 399 J/2x 399 
x~ 1 — (12.5) x 0 


Thus, there is almost no doubt that the Poisson distribution fails to fit the data. E 


Hanging Rootograms 


In this and the next section, we develop additional informal techniques for assessing 
goodness of fit. The first of these is the hanging rootogram. Hanging rootograms are a 
graphical display of the differences between observed and fitted values in histograms. 
To illustrate the construction and interpretation of hanging rootograms, we will use 
a set of data from the field of clinical chemistry (Martin, Gudzinowicz, and Fanger 
1975). The following table gives the empirical distribution of 152 serum potassium 
levels. In clinical chemistry, distributions such as this are often tabulated to establish 
a range of “normal” values against which the level of the chemical found in a patient 
can be compared to determine whether it is abnormal. The tabulated distributions are 
often fit to parametric forms such as the normal distribution. 
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Serum Potassium Levels 








Interval Midpoint Frequency 
3.2 2 
3:3 1 
3.4 3 
3.5 2 
3.6 7 
3:7 8 
3.8 8 
3.9 14 
4.0 14 
4.1 18 
4.2 16 
4.3 15 
4.4 10 
4.5 8 
4.6 8 
4.7 6 
4.8 4 
4.9 1 
5.0 1 
5.1 1 
5.2 4 
5.3 1 





Figure 9.3(a) is a histogram of the frequencies. The plot looks roughly bell- 
shaped, but the normal distribution is not the only bell-shaped distribution. In order to 
evaluate their distribution more exactly, we must compare the observed frequencies 
to frequencies fit by the normal distribution. This can be done in the following way. 
Suppose that the parameters jz and o of the normal distribution are estimated from the 
data by x and ô. If the jth interval has the left boundary x ;_; and the right boundary x, 
then according to the model, the probability that an observation falls in that interval is 


pe (Ere) 
o oO 


If the sample is of size n, the predicted, or fitted, count in the jth interval is 





ñj =npj 
which may be compared to the observed counts, n j. 

Figure 9.3(b) is a hanging histogram of the differences: observed count (n;) 
minus expected count (/;). These differences are difficult to interpret since the vari- 
ability is not constant from cell to cell. If we neglect the variability in the estimated 
expected counts, we have 

Var(n; — fi) = Var(n;) = np;(1 — pj) 


= npj — npj 
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FIGURE 9.3 (a) Histogram, (b) hanging histogram, (c) hanging rootogram, and 
(d) hanging chi-gram for normal fit to serum potassium data. 


In this case, the p; are small, so 
Var(n ; — fij) ~ npj 


Thus, cells with large values of p; (equivalent to large values of n; if the model is at 
all close) have more variable differences, n; — f;. In a hanging histogram, we expect 
larger fluctuations in the center than in the tails. This unequal variability makes it 
difficult to assess and compare the fluctuations, since a large deviation may indicate 
real misfit of the model or may be merely caused by large random variability. 

To put the differences between observed and expected values on a scale on which 
they all have equal variability, a variance-stabilizing transformation may be used. 
(Such transformations will be used in later chapters as well.) Suppose that a random 
variable X has mean yw and variance o? (11), which depends on u. If Y = f (X), the 
method of propagation of error (Section 4.6) shows that 


Var(Y) © o?(u)Lf' (wr 


Thus if f is chosen so that o? (w)[ f’ (u)]? is constant, the variance of Y will not depend 
on u. The transformation f that accomplishes this is called a variance-stabilizing 
transformation. 

Let us apply this idea to the case we have been considering: 


E(nj)-—np;—ygu 
Var(nj) © np; = o^(u) 
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That is, o?(4) = u, so f will be a variance-stabilizing transformation if u[ f' (i)? 
is constant. The function f(x) = ./x does the job, and 


E nj) © /np; 
Var(./nj ) © + 
if the model is correct. 
Figure 9.3(c) shows a hanging rootogram, a display showing 


fit PM fi j 
The advantage of the hanging rootogram is that the deviations from cell to cell have 
approximately the same statistical variability. To assess the deviations, we may use the 
rough rule of thumb that a deviation of more than 2 or 3 standard deviations (more than 
1.0 or 1.5 in this case) is “large.” The most striking feature of the hanging rootogram in 
Figure 9.3(c) is the large deviation in the right tail. Generally, deviations in the center 
have been down-weighted and those in the tails emphasized by the transformation. 
Also, it is noteworthy that although the deviations other than the one in the right 
tail are not especially large, they have a certain systematic character: Note the run 
of positive deviations followed by the run of negative deviations and then the large 
positive deviation in the extreme right tail. This may indicate some asymmetry in the 
distribution. 

A possible alternative to the rootogram is what can be called a hanging chi-gram, 

a plot of the components of Pearson's chi-square statistic: 

n; Nj 

vftj 
Neglecting the variability in the expected counts, as before, Var(n; —fj) ~ np; = fij 


J? 
SO 
n; —j 
Var | ~— | ~1 
Vj 


so this technique also stabilizes variance. Figure 9.3(d) is a hanging chi-gram for the 
case we have been considering; it is quite similar in overall character to the hanging 
rootogram, but the deviation in the right tail is emphasized even more. 





Probability Plots 


Probability plots are an extremely useful graphical tool for qualitatively assessing the 
fit of data to a theoretical distribution. Consider a sample of size n from a uniform 
distribution on [0, 1]. Denote the ordered sample values by Xa) < Xo < X. 
These values are called order statistics. It can be shown (see Problem 17 at the end 
of Chapter 4) that 


j 
E(X anj) 
X= +1 
This suggests plotting the ordered observations X(1),..., Xi) against their expected 
values 1/(n + 1), ..., n/(n + 1). If the underlying distribution is uniform, the plot 
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FIGURE 9.4 Uniform-uniform probability plot. 


should look roughly linear. Figure 9.4 is such a plot for a sample of size 100 from a 
uniform distribution. 

Now suppose that a sample Yi, ..., Yioo is generated in which each Y is half the 
sum of two independent uniform random variables. The distribution of Y is no longer 
uniform but triangular: 


= Vie 


zd 5 0x 
jupe Ea ix 


The ordered observations Ya), .. . , Yn) are plotted against the points 1/(n + 1), ..., 
n/(n+ 1). The graph in Figure 9.5 shows a clear deviation from linearity and enables 
us to describe qualitatively the deviation of the distribution of the Y's from the uniform 
distribution. Note that in the left tail of the plotted distribution (near 0), the order statis- 
tics are larger than expected for a uniform distribution, and in the right tail (near 1), 
they are smaller, indicating that the tails of the distribution of the Y's decrease more 
quickly (are "lighter") than the tails of the uniform distribution. 

The technique can be extended to other continuous probability laws by means of 
Proposition C of Section 2.3, which states that if X is a continuous random variable 
with a strictly increasing cumulative distribution function, Fy, and if Y = Fx(X), 
then Y has a uniform distribution on [0, 1]. The transformation Y = Fx(X) is known 
as the probability integral transform. 

The following procedure is suggested by the proposition just referred to. Sup- 
pose that it is hypothesized that X follows a certain distribution, F. Given a sample 
X,,..., Xn, we plot 





F(X vs. 
(Xw) EA 
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FIGURE 9.5 Uniform-triangular probability plot. 


or equivalently 





k 
X (k) vs. F! ( ) 
n+1 


In some cases, F is of the form 


Fo) = c (25) 
oO 


where u and o are called location and scale parameters, respectively. The normal 
distribution is of this form. We could plot 


X — k 
SUUM vs. GT! 
[o n+1 


k 
Xk) VS. G! ( ) 
n+1 


the result would be approximately a straight line if the model were correct: 


k 
Xm N oG” 
bh ~ (a) +H 


Slight modifications of this procedure are sometimes used. For example, rather than 
G-![k / (n + 1)], E(X œw), the expected value of the kth smallest observation can be 








or if we plotted 
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used. But it can be argued that 





n+l 


so this modification yields very similar results to the original procedure. 

The procedure can be viewed from another perspective. Recall from Section 2.2 
that F^! [k/(n + 1)] is the k/(n + 1) quantile of the distribution F; that is, it is the 
point such that the probability that a random variable with distribution function F is 
less than it is k/(n + 1). We are thus plotting the ordered observations (which may be 
viewed as the observed or empirical quantiles) versus the quantiles of the theoretical 
distribution. 


We can illustrate the procedure just described using a set of 100 observations, which 
are Michelson’s determinations of the velocity of light made from June 5, 1879 to 
July 2, 1879; 299,000 has been subtracted from the determinations to give the values 
listed [data from Stigler (1977)]: 


850 960 880 890 890 740 
940 880 810 840 900 960 
880 810 780 1070 940 860 
820 810 930 880 720 800 
760 850 800 720 7710 810 
950 850 620 760 790 980 
880 860 740 810 980 900 
970 750 820 880 840 950 
760 850 1000 830 880 910 
870 980 790 910 920 870 
930 810 850 890 810 650 
880 870 860 740 760 880 
840 880 810 810 830 840 
720 940 1000 800 850 840 
950 1000 790 840 850 800 
960 760 840 850 810 960 
800 840 780 870 


Figure 9.6 shows the normal probability plot. The plot looks straight, showing that 
the normal distribution gives a reasonable fit. 

A word of caution is in order here: Probability plots are by nature monotone- 
increasing and they all tend to look fairly straight. Some experience is necessary 
in gauging “straightness.” Simulations, which are easily done, are very helpful in 
sharpening one’s judgment. Some find it useful to hold the plot so that they are 
looking down the plotted line as if it were a roadway; this often makes curvature 
much more apparent. E 
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FIGURE 9.6 Normal probability plot of Michelson's data. 


In order to be able to interpret probability plots, it is useful to see how they are shaped 
for samples from nonnormal distributions. Figure 9.7 is a normal probability plot of 
500 pseudorandom variables from a double exponential distribution: 
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FIGURE 9.7 Normal probability plot of 500 pseudorandom variables from a double 
exponential distribution. 
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FIGURE 9.8 Normal probability plot of 500 pseudorandom variables from a 
gamma distribution with the shape parameter a = 5. 


This density is symmetric about zero, but its tails die off at the rate exp(—|x|). This 
rate is slower than that for the tails of the normal distribution, which decay at the rate 
exp(—x7). Note how the plot in Figure 9.7 bends down at the left and up at the right, 
indicating that the observations in the left tail were more negative than expected for a 
normal distribution and the observations in the right tail were more positive. In other 
words, the extreme observations were larger in magnitude than extreme observations 
from a normal distribution would be. This effect results because the tails of the double 
exponential are “heavier” than those of a normal distribution. 

Figure 9.8 is a normal probability plot of 500 pseudorandom numbers from a 
gamma distribution with the shape parameter œ = 5 and the scale parameter A = 1. 
As can be seen in Figure 2.11, the gamma density with aw = 5 is nonsymmetric, 
or skewed, and this is reflected by the bowlike appearance of the probability 
plot. a 


As an example for a nonnormal distribution, Figure 9.9 is a gamma probability plot 
of the precipitation amounts of Example C in Section 8.4. 

The parameter A of a gamma distribution is a scale parameter and so, as we have 
seen before, affects only the slope of a probability plot, not its straightness. Thus in 
constructing a probability plot we can take A = | without loss. A computer was used 
to find the quantiles of a gamma distribution with parameter œ = .471 and X = 1, 
and Figure 9.9 was produced by plotting the observed sorted values of precipitation 
versus the quantiles. Qualitatively, the fit looks reasonable, because there is no gross 
systematic deviation from a straight line. E 
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FIGURE 9.9 Gamma probability plot of rainfall distribution. 


Probability plots can also be constructed for grouped data, such as the data on 
serum potassium levels in Section 9.7. Because the ordered observations are not all 
available in such a case, the procedure must be modified. Suppose that the grouping 
gives the points x1, X2,..., Xm+1 for the histogram’s bin boundaries and that in the 
interval [x;, x;+1) there are n; counts, where i = 1,...,71. We denote the cumulative 
frequencies by N; — Xs ni. Then N, < No < --- < Nm and Nm = n, which is the 
total sample size. We thus plot 


N; 
Xj4l VS. o ( 4 ). Jj d od 





n+l 


Figure 9.10 shows a normal probability plot for the serum potassium data of Section 
9.7. The cumulative frequencies are found by summing the frequencies in each bin. 
The deviations in the right tail are immediately apparent. E 


Tests for Normality 


A wide variety of tests are available for testing goodness of fit to the normal distribu- 
tion. We discuss some of them in this section; more discussion may be found in the 
works referred to. 
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FIGURE 9.10 Normal probability plot of serum potassium data. 


If the data are grouped into bins, with several counts in each bin, Pearson’s chi- 
square test for goodness of fit may be applied. But if the parameters are estimated 
from ungrouped data and the expected counts in each bin are calculated using the 
estimated parameters, the limiting distribution of the test statistic is no longer chi- 
square. In order for the limiting distribution to be chi-square, the parameters must 
be estimated from the grouped data. This was pointed out by Chernoff and Lehmann 
(1954) and is further discussed by Dahiya and Gurland (1972). Generally speaking, 
it seems rather artificial and wasteful of information to group continuous data. 

Departures from normality often take the form of asymmetry, or skewness. For a 
normal distribution, the third central moment is f D (x — u^ e(x)dx, which equals 0 
since the density is symmetric about jz. Suppose that we wish to test the null hypothesis 
that X;, ..., X, are independent normally distributed random variables with the same 
mean and variance. A goodness-of-fit test can be based on the coefficient of skewness 
for the sample, 


ie "S 
| 2 - Xy 


bı = 
$3 


The test rejects for large values of |b;|. 

Symmetric distributions can depart from normality by being heavy-tailed or light- 
tailed or too peaked or too flat in the center. These forms of departures may be detected 
by the coefficient of kurtosis for the sample, 


lc = 
i noe -xy* 


b, = 
gt 
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If either of these measures is to be used as a test statistic, its sampling distribution 
when the distribution generating the data is normal must be determined. The hypoth- 
esis test rejects normality when the observed value of the statistic is in the tails of the 
sampling distribution. These sampling distributions are difficult to evaluate in closed 
form but can be approximated by simulations. 

A goodness-of-fit test may also be based on the linearity of the probability 
plot, as measured by the correlation coefficient, r, of the x and y coordinates of 
the points of the probability plot. The test rejects for small values of r. The sam- 
pling distribution of r under normality has been approximated by simulations and 
is tabled in Filliben (1975). Ryan and Joiner (unpublished) give a short table for 
the null sampling distribution of r from normal probability plots with critical val- 
ues for the correlation coefficient corresponding to significance levels .1, .05, 
and .01: 








n z .05 .01 
4 .8951 .8734 .8318 
5 .9033 .8804 .8320 

10 .9347 .9180 .8804 

15 .9506 .9383 9110 

20 .9600 .9503 .9290 
25 .9662 .9582 .9408 
30 .9707 .9639 .9490 
40 .9767 9715 9597 
50 .9807 .9764 .9664 
60 .9836 .9799 9710 
75 .9865 .9835 9752 





They also report the results of some simulations of the power of r as the test statistic for 
certain alternative distributions. For example, the power against a uniform distribution 
with a significance level of .1 is .13 for n = 10 and .20 for n = 20. This is somewhat 
discouraging—the test rejects only 13% of the time and 20% of the time for the given 
sample sizes if the real underlying distribution is uniform. The moral is that it may be 
quite difficult to detect departure from normality in small samples. On a more positive 
note, the power of r against an exponential distribution is 5396 for n — 10 and 8996 
for n — 20. 

Pearson, D'Agostino, and Bowman (1977) report the results of quite extensive 
simulations of the power for several alternative distributions and give further refer- 
ences. 

For Michelson's data (see Example A in Section 9.8), the correlation coefficient 
is .995. From the tables in Filliben (1975), this falls between the 50th and 75th 
percentiles of the null sampling distribution, giving no reason to reject the hypothesis 
of normality. It may not be very realistic, however, to model the 100 observations 
of the velocity of light as a sample of 100 independent random variables from some 
probability distribution and to use this model to test goodness of fit. We have little 
information about how these data were collected and processed. For example, since 
the observations were made sequentially, it is quite possible that the measurement 
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process drifted in time or that successive errors were correlated. It is also possible 
that Michelson discarded some obviously bad data. 


Concluding Remarks 


Two very important concepts, estimation and hypothesis testing, have been introduced 
in this chapter and the last. They have been introduced here in the context of fitting 
probability distributions but will recur throughout the rest of this book in various 
other contexts. Generally, observations are taken from a probability law that depends 
on a parameter, 0. Estimation theory is concerned with estimating 0 from the data; 
the theory of hypothesis testing is concerned with testing hypotheses about the value 
of 0. Methods based on likelihood, maximum likelihood estimation and likelihood 
ratio tests, have also been introduced. These methods are much more generally useful 
than has been demonstrated by the specific purposes to which they have been put in 
these chapters. The likelihood and the likelihood ratio are key concepts of statistics, 
from both Bayesian and frequentist perspectives. 

The fundamental concepts and techniques of hypothesis testing have been intro- 
duced in this chapter. We have seen how to test a null hypothesis by choosing a test 
statistic and a rejection region such that, under the null hypothesis, the probability 
that the test statistic falls in the rejection region is o, the significance level of the 
test. The choice of this region is determined by knowing, at least approximately, the 
null distribution of the test statistic. The test statistic is frequently, but not always, a 
likelihood ratio; when the exact distribution of the likelihood ratio cannot be found, 
we can use the chi-square distribution as a large-sample approximation. We have also 
explored the relation of the p-value of the test statistic to the significance level. In 
some situations, the p-value is a less rigid summary of the evidence than is a decision 
whether to reject the null hypothesis. 

With the increasing availability of flexible computer programs and inexpensive 
computers, graphical methods are being used more and more in statistics. The last part 
of this chapter introduced two graphical techniques: hanging rootograms and probabil- 
ity plots. Other graphical techniques will be introduced in Chapter 10. Such informal 
techniques are often of more practical use than are more formal techniques, such as 
hypothesis testing. Literally testing for goodness of fit is often rather artificial —a 
parametric distribution is usually entertained only as a model for the distribution of 
data values, and it is clear a priori that the data do not really come from that distri- 
bution. If enough data were available, the goodness-of-fit test would certainly reject. 
Rather than test a hypothesis that no one literally believes could hold, itis usually more 
useful to ascertain qualitatively where the model fits and where and how it fails to fit. 

Some concepts introduced in Chapters 7 and 8 have been elaborated on in this 
chapter. In Chapter 7, we introduced confidence intervals for the parameters of fi- 
nite populations; in Chapter 8, we considered confidence intervals for parameters of 
probability distributions. In this chapter, we have introduced hypothesis testing and 
developed a relation between hypothesis tests and confidence intervals. The method of 
propagation of error, used in Chapter 7 as a tool for analyzing the statistical behavior 
of ratio estimates, has been used in this chapter in connection with variance-stabilizing 
transformations. 
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9.11 Problems 


1. Acoin is thrown independently 10 times to test the hypothesis that the probability 
of heads is 1 versus the alternative that the probability is not 1. The test rejects 

if either 0 or 10 heads are observed. 

a. What is the significance level of the test? 


b. If in fact the probability of heads is .1, what is the power of the test? 


2. Which of the following hypotheses are simple, and which are composite? 


a. X follows a uniform distribution on [0, 1]. 

b. A die is unbiased. 

c. X follows a normal distribution with mean 0 and variance o? > 10. 
d. X follows a normal distribution with mean u = 0. 


3. Suppose that X ~ bin(100, p). Consider the test that rejects Ho: p = .5 in favor 
of H4: p # .5 for |X — 50| > 10. Use the normal approximation to the binomial 
distribution to answer the following: 

a. What is o? 
b. Graph the power as a function of p. 


4. Let X have one of the following distributions: 








X Ho H; 
X1 22. 1 
X2 3 4 
X3 : 1 
X4 2 4 





a. Compare the likelihood ratio, A, for each possible value X and order the x; 
according to A. 

b. What is the likelihood ratio test of Hp versus H4 at level a = .2? What is the 
test at level a = .5? 

c. If the prior probabilities are P(Ho) = P(H4), which outcomes favor Ho? 

d. What prior probabilities correspond to the decision rules with a = .2 and 
a —.5? 


5. True or false, and state why: 


a. The significance level of a statistical test is equal to the probability that the 
null hypothesis is true. 

b. If the significance level of a test is decreased, the power would be expected to 
increase. 

c. If a test is rejected at the significance level o, the probability that the null 
hypothesis is true equals o. 

d. The probability that the null hypothesis is falsely rejected is equal to the power 
of the test. 

e. A type Lerror occurs when the test statistic falls in the rejection region of the 
test. 


10. 


11. 


12. 


13. 
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f. A type H error is more serious than a type I error. 
g. The power of a test is determined by the null distribution of the test statistic. 
h. The likelihood ratio is a random variable. 


. Consider the coin tossing example of Section 9.1. Suppose that instead of tossing 


the coin 10 times, the coin was tossed until a head came up and the total number 
of tosses, X, was recorded. 


a. If the prior probabilities are equal, which outcomes favor Hp and which favor 
Hj? 

b. Suppose P(Ho)/ P(Hj) = 10. What outcomes favor Ho? 

c. What is the significance level of a test that rejects Ho if X > 8? 

d. What is the power of this test? 


. Let X;,..., X, be a sample from a Poisson distribution. Find the likelihood ratio 


for testing Hp: à = Ao versus H4: A = A4, where 4; > Ag. Use the fact that the 
sum of independent Poisson random variables follows a Poisson distribution to 
explain how to determine a rejection region for a test at level a. 


. Show that the test of Problem 7 is uniformly most powerful for testing Ho: à = Ao 


versus H4: à > Xo. 


. Let X,,..., X25 be a sample from a normal distribution having a variance of 


100. Find the rejection region for a test at level a = .10 of Ho: uw = O versus 
H4: u = 1.5. What is the power of the test? Repeat for o = .01. 


Suppose that X;,..., X, form a random sample from a density function, f (x|0), 
for which T is a sufficient statistic for 0. Show that the likelihood ratio test of 
Ho: 0 = 0$ versus H4: 0 = 0, is a function of T. Explain how, if the distribution 
of T is known under Hp, the rejection region of the test may be chosen so that 
the test has the level o. 


Suppose that X1, ..., X55 form a random sample from a normal distribution hav- 
ing a variance of 100. Graph the power of the likelihood ratio test of Ho: u = 0 
versus H4: u # 0 as a function of ju, at significance levels .10 and .05. Do the 
same for a sample size of 100. Compare the graphs and explain what 
you see. 


Let X;,..., X, bea random sample from an exponential distribution with the 
density function f (x|0) = 0 exp[—0x]. Derive a likelihood ratio test of Ho: 0 = 
Oo versus H4:0 + 6, and show that the rejection region is of the form 
(X exp[-05X] < c]. 


Suppose, to be specific, that in Problem 12, 6) = 1, n = 10, and that æ = .05. In 
order to use the test, we must find the appropriate value of c. 


a. Show that the rejection region is of the form (X < xo} U (X > xı}, where xo 
and x, are determined by c. 

b. Explain why c should be chosen so that P(X exp(—X) « c) — .05 when 
0o — 1. 

c. Explain why $7 "a X; and hence X follow gamma distributions when 6) = 1. 
How could this knowledge be used to choose c? 
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14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


d. Suppose that you hadn't thought of the preceding fact. Explain how you could 
determine a good approximation to c by generating random numbers on a 
computer (simulation). 


Suppose that under Hy, a measurement X is N(0, o?), and that under H4, X is 
N(1, o?) and that the prior probability P (Ho) = 2x P(Hj). Asin Section 9.1, the 
hypothesis Ho will be chosen if P(Ho|x) > P(H,|x). Foro? = 0.1, 0.5, 1.0, 5.0: 


a. For what values of X will Hp be chosen? 
b. In the long run, what proportion of the time will Ho be chosen if Hp is true Z 
of the time? 


Suppose that under Ho, a measurement X is N (0, o*), and that under Hj, X 
is N(1, 0?) and that the prior probability P(Họ) = P(H,). For o. = 1 and 
x € [0, 3], plot and compare (1) the p-value for the test of Hp and (2) P(Ho|x). 
Can the p-value be interpreted as the probability that Ho is true? Choose another 
value of c and repeat. 


In the previous problem, with o = 1, what is the probability that the p-value is 
less than 0.05 if Ho is true? What is the probability if H; is true? 


Let X ~ N(0, o°), and consider testing Hp : o4 = oo versus H4 : o = o4, where 
0, > OQ. The values o and o; are fixed. 


a. What is the likelihood ratio as a function of x? What values favor Ho? What 
is the rejection region of a level o test? 

b. For a sample, X;, X2, ..., X, distributed as above, repeat the previous ques- 
tion. 

c. Is the test in the previous question uniformly most powerful for testing 
Ho : o = oo versus Hi : o 09? 


Let X1, X2,..., X, be i.i.d. random variables from a double exponential distri- 
bution with density f(x) = 44 exp(—A|x|). Derive a likelihood ratio test of the 
hypothesis Ho : A = Ao versus A, : à = A4, where Ao and A; > Ao are specified 
numbers. Is the test uniformly most powerful against the alternative Hj : A > A9? 


Under Hp, a random variable has the cumulative distribution function Fo(x) = x?, 

0 < x < 1; and under Hi, it has the cumulative distribution function F; (x) = x?, 

Qxxl. 

a. If the two hypotheses have equal prior probability, for what values of x is the 
posterior probability of Ho greater than that of Hj? 

b. What is the form of the likelihood ratio test of Hy versus H4? 

c. What is the rejection region of a level o test? 

d. What is the power of the test? 


Consider two probability density functions on [0, 1]: fo(x) = 1, and fi (x) = 2x. 
Among all tests of the null hypothesis Ho : X ^ fo(x) versus the alternative X ~ 
fi Go, with significance level a = 0.10, how large can the power possibly be? 


Suppose that a single observation X is taken from a uniform density on [0, 0], 
and consider testing Ho : 0 = 1 versus H; : 0 = 2. 


22. 


23. 


24. 


25. 


26. 
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a. Find a test that has significance level o = 0. What is its power? 

b. For 0 < o < 1, consider the test that rejects when X € [0, o]. What is its 
significance level and power? 

c. What is the significance level and power of the test that rejects when X € 
[1 — o, 1]? 

d. Find another test that has the same significance level and power as the previous 

one. 

. Does the likelihood ratio test determine a unique rejection region? 

f. What happens if the null and alternative hypotheses are interchanged— Hj : 
0 = 2 versus H, : 0 = 1? 


e 


In Example A of Section 8.5.3 a confidence interval for the variance of a normal 
distribution was derived. Use Theorem B of Section 9.3 to derive an acceptance 
region for testing the hypothesis Ho: o? = of at the significance level o based on 
a sample X1, X5, ..., Xn. Precisely describe the rejection region if oo = 1, n = 
15, a = .05. 


Suppose that a 99% confidence interval for the mean jz of a normal distribution 
is found to be (—2.0, 3.0). Would a test of Ho: jj = —3 versus Hy: u ~ —3 be 
rejected at the .01 significance level? 


Let X be a binomial random variable with n trials and probability p of success. 


a. What is the generalized likelihood ratio for testing Ho: p = .5 versus 
Ha: p xz .5? 

b. Show that the test rejects for large values of | X — n/2]. 

c. Using the null distribution of X, show how the significance level corresponding 
to a rejection region | X — n/2| > k can be determined. 


d. If 1 = 10 and k = 2, what is the significance level of the test? 


e. Use the normal approximation to the binomial distribution to find the signifi- 
cance level if n = 100 and k = 10. 


This analysis is the basis of the sign test, a typical application of which would be 
something like this: An experimental drug is to be evaluated on laboratory rats. 
In n pairs of litter mates, one animal is given the drug and the other is given a 
placebo. A physiological measure of benefit is made after some time has passed. 
Let X be the number of pairs for which the animal receiving the drug benefited 
more than its litter mate. A simple model for the distribution of X if there is no 
drug effect is binomial with p — .5. This is then the null hypothesis that must 
be made untenable by the data before one could conclude that the drug had an 
effect. 


Calculate the likelihood ratio for Example B of Section 9.5 and compare the 
results of a test based on the likelihood ratio to those of one based on Pearson's 
chi-square statistic. 


True or false: 


a. The generalized likelihood ratio statistic A is always less than or equal to 1. 
b. If the p-value is .03, the corresponding test will reject at the significance 
level .02. 
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27. 


28. 


29. 


30. 


31. 


32. 


33. 


c. If a test rejects at significance level .06, then the p-value is less than or equal 
to .06. 

d. The p-value of a test is the probability that the null hypothesis is correct. 

e. In testing a simple versus simple hypothesis via the likelihood ratio, the 
p-value equals the likelihood ratio. 

f. If a chi-square test statistic with 4 degrees of freedom has a value of 8.5, the 
p-value is less than .05. 


What values of a chi-square test statistic with 7 degrees of freedom yield a p-value 
less than or equal to .10? 


Suppose that a test statistic T has a standard normal null distribution. 


a. If the test rejects for large values of |T |, what is the p-value corresponding to 
T — 1.50? 
b. Answer the same question if the test rejects for large T. 


Suppose that a level o test based on a test statistic T rejects if T > tọ. Suppose 
that g is a monotone-increasing function and let S = g(T). Is the test that rejects 
if S > g(t) a level æ test? 


Suppose that the null hypothesis is true, that the distribution of the test statistic, 
T say, is continuous with cdf F and that the test rejects for large values of T. 
Let V denote the p-value of the test. 


a. Show that V = 1 — F(T). 

b. Conclude that the null distribution of V is uniform. (Hint: See Proposition C 
of Section 2.3.) 

c. If the null hypothesis is true, what is the probability that the p-value is greater 
than .1? 

d. Show that the test that rejects if V — o has significance level o. 


What values of the generalized likelihood ratio A are necessary to reject the null 
hypothesis at the significance level æ = .1 if the degrees of freedom are 1, 5, 10, 
and 20? 


The intensity of light reflected by an object is measured. Suppose there are two 
types of possible objects, A and B. If the object is of type A, the measurement is 
normally distributed with mean 100 and standard deviation 25; if it is of type B, 
the measurement is normally distributed with mean 125 and standard deviation 
25. A single measurement is taken with the value X — 120. 


a. What is the likelihood ratio? 

b. If the prior probabilities of A and B are equal G each), what is the posterior 
probability that the item is of type B? 

c. Suppose that a decision rule has been formulated that declares the object to be 
of type B if X > 125. What is the significance level associated with this rule? 

d. What is the power of this test? 

e. What is the p-value when X — 120? 


It has been suggested that dying people may be able to postpone their death until 
after an important occasion, such as a wedding or birthday. Phillips and King 


34. 


35. 


36. 


37. 
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(1988) studied the patterns of death surrounding Passover, an important Jewish 
holiday, in California during the years 1966-1984. They compared the number of 
deaths during the week before Passover to the number of deaths during the week 
after Passover for 1919 people who had Jewish surnames. Of these, 922 occurred 
in the week before Passover and 997, in the week after Passover. The significance 
of this discrepancy can be assessed by statistical calculations. We can think of 
the counts before and after as constituting a table with two cells. If there is no 
holiday effect, then a death has probability 1 of falling in each cell. Thus, in 
order to show that there is a holiday effect, it is necessary to show that this simple 
model does not fit the data. Test the goodness of fit of the model by Pearson's 
X? test or by a likelihood ratio test. Repeat this analysis for a group of males of 
Chinese and Japanese ancestry, of whom 418 died in the week before Passover 
and 434 died in the week after. What is the relevance of this latter analysis to the 
former? 


Test the goodness of fit of the data to the genetic model given in Problem 55 of 
Chapter 8. 


Test the goodness of fit of the data to the genetic model given in Problem 58 of 
Chapter 8. 


The National Center for Health Statistics (1970) gives the following data on 
distribution of suicides in the United States by month in 1970. Is there any 
evidence that the suicide rate varies seasonally, or are the data consistent with 
the hypothesis that the rate is constant? (Hint: Under the latter hypothesis, model 
the number of suicides in each month as a multinomial random variable with the 
appropriate probabilities and conduct a goodness-of-fit test. Look at the signs of 
the deviations, O; — E;, and see if there is a pattern.) 





Number 
Month of Suicides Days/Month 
Jan. 1867 31 
Feb. 1789 28 
Mar. 1944 31 
Apr. 2094 30 
May 2097 31 
June 1981 30 
July 1887 31 
Aug. 2024 31 
Sept. 1928 30 
Oct. 2032 31 
Nov. 1978 30 
Dec. 1859 31 





The following table gives the number of deaths due to accidental falls for each 
month during 1970. Is there any evidence for a departure from uniformity in the 
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rate over time? That is, is there a seasonal pattern to this death rate? If so, describe 
its pattern and speculate as to causes. 








Month Number of Deaths 
Jan. 1668 
Feb. 1407 
Mar. 1370 
Apr. 1309 
May 1341 
June 1338 
July 1406 
Aug. 1446 
Sept. 1332 
Oct. 1363 
Nov. 1410 
Dec. 1526 





38. Yipetal. (2000) studied seasonal variations in suicide rates in England and Wales 
during 1982-1996, collecting counts shown in the following table: 





Month Jan Feb Mar Apr | May | June | July | Aug | Sept Oct Nov | Dec 





Male 3755 | 3251 | 3777 | 3706 | 3717 | 3660 | 3669 | 3626 | 3481 | 3590 | 3605 | 3392 
Female | 1362 | 1244 | 1496 | 1452 | 1448 | 1376 | 1370 | 1301 | 1337 | 1351 | 1416 | 1226 









































Do either the male or female data show seasonality? 


39. There is a great deal of folklore about the effects of the full moon on humans and 
other animals. Do animals bite humans more during a full moon? In an attempt 
to study this question, Bhattacharjee et al. (2000) collected data on admissions to 
a medical facility for treatment of bites by animals: cats, rats, horses, and dogs. 
95% of the bites were by man’s best friend, the dog. The lunar cycle was divided 
into 10 periods, and the number of bites in each period is shown in the following 
table. Day 29 is the full moon. Is there a temporal trend in the incidence of bites? 





Lunar Day 16,17,18 | 19,20,21 | 22,23,24 | 25,26,27 | 28,29,1 | 2,3,4 | 5,6,7 | 89,10 | 11,12,13 | 14,15 


























Number of Bites 137 150 163 201 269 155 | 142 146 148 110 

















40. Consider testing goodness of fit for a multinomial distribution with two cells. 
Denote the number of observations in each cell by X; and X» and let the hypothe- 
sized probabilities be pı and p». Pearson's chi-square statistic is equal to 


5 (X; — np; 


np; 
i=l Pi 
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Show that this may be expressed as 


(Xi — np 

npi(1 — pi) 
Because X, is binomially distributed, the following holds approximately under 
the null hypothesis: 


X,— 
OE UMP A N(O, 1) 
Vnpi(l — pi) 
Thus, the square of the quantity on the left-hand side is approximately distributed 
as a chi-square random variable with 1 degree of freedom. 


Let X; ~ bin(n;, pi), fori = 1, ..., m, be independent. Derive a likelihood ratio 
test for the hypothesis 


Ho: pj = pa = +++ = Pm 


against the alternative hypothesis that the p; are not all equal. What is the large- 
sample distribution of the test statistic? 


Nylon bars were tested for brittleness (Bennett and Franklin 1954). Each of 280 
bars was molded under similar conditions and was tested in five places. Assuming 
that each bar has uniform composition, the number of breaks on a given bar should 
be binomially distributed with five trials and an unknown probability p of failure. 
If the bars are all of the same uniform strength, p should be the same for all of 
them; if they are of different strengths, p should vary from bar to bar. Thus, the 
null hypothesis is that the p's are all equal. The following table summarizes the 
outcome of the experiment: 


Breaks/Bar Frequency 


0 157 
1 69 
2 35 
3 17 
+ 1 
5 1 





a. Under the given assumption, the data in the table consist of 280 observations 
of independent binomial random variables. Find the mle of p. 

b. Pooling the last three cells, test the agreement of the observed frequency 
distribution with the binomial distribution using Pearson’s chi-square test. 

c. Apply the test procedure derived in the previous problem. 


a. In 1965, a newspaper carried a story about a high school student who reported 
getting 9207 heads and 8743 tails in 17,950 coin tosses. Is this a significant 
discrepancy from the null hypothesis Ho: p = 1? 

b. Jack Youden, a statistician at the National Bureau of Standards, contacted the 
student and asked him exactly how he had performed the experiment (Youden 
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1974). To save time, the student had tossed groups of five coins at a time, and 
a younger brother had recorded the results, shown in the following table: 





Number of Heads Frequency 





0 100 
1 524 
2 1080 
3 1126 
4 655 
5 105 





Are the data consistent with the hypothesis that all the coins were fair 
(pz 

c. Are the data consistent with the hypothesis that all five coins had the same 
probability of heads but that this probability was not necessarily i? (Hint: Use 
the binomial distribution.) 


44. Derive and carry out a likelihood ratio test of the hypothesis Ho: 0 = i versus 
Hi:0 x 1 for Problem 58 of Chapter 8. 


45. In a classic genetics study, Geissler (1889) studied hospital records in Saxony 
and compiled data on the gender ratio. The following table shows the number 
of male children in 6115 families with 12 children. If the genders of successive 
children are independent and the probabilities remain constant over time, the 
number of males born to a particular family of 12 children should be a binomial 
random variable with 12 trials and an unknown probability p of success. If the 
probability of a male child is the same for each family, the table represents the 
occurrence of 6115 binomial random variables. Test whether the data agree with 
this model. Why might the model fail? 











Number Frequency 
0 T 
1 45 
2 181 
3 478 
4 829 
5 1112 
6 1343 
T 1033 
8 670 
9 286 
10 104 
11 24 
12 3 
46. Show that the transformation Y = sin! f is variance-stabilizing if p = X/n, 


where X ~ bin(n, p). 


47. 


48. 


49. 


50. 
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Let X follow a Poisson distribution with mean à. Show that the transformation 
Y = y X is variance-stabilizing. 


Suppose that E(X) = u and Var(X) = cy”, where c is a constant. Find a 
variance-stabilizing transformation. 


An English naturalist collected data on the lengths of cuckoo eggs, measuring to 
the nearest .5 mm. Examine the normality of this distribution by (a) constructing 
a histogram and superposing a normal density, (b) plotting on normal probability 
paper, and (c) constructing a hanging rootogram. 








Length Frequency 
18.5 0 
19.0 1 
19.5 3 
20.0 33 
20.5 39 
21.0 156 
21.5 152 
22.0 392 
22.5 288 
23.0 286 
23.5 100 
24.0 86 
24.5 21 
25.0 12 
25:5 2 
26.0 0 
26.5 1 





Burr (1974) gives the following data on the percentage of manganese in iron 
made in a blast furnace. For 24 days, a single analysis was made on each of five 
casts. Examine the normality of this distribution by making a normal probability 
plot and a hanging rootogram. (As a prelude to topics that will be taken up in 
later chapters, you might also informally examine whether the percentage of 
manganese is roughly constant from one day to the next or whether there are 
significant trends over time.) 





Day Day Day Day Day Day Day Day Day Day Day Day 
1 2 3 4 5 6 7 8 9 10 11 12 





1.40 1.40 1.80 1.54 152 1.62 1.58 1.62 1.60 1.38 1.34 1.50 
1.28 1.34 1.44 1.50 1.46 1.58 1.64 1.46 1.44 1.34 1.28 1.46 
1.56 1.54 146 1.48 1.42 1.62 1.62 1.38 1.46 1.36 1.08 1.28 
1.38 1.44 1.50 1.52 1.58 1.76 1.72 142 1.38 1.58 1.08 1.18 
144 1.46 1.38 1.58 1.70 1.68 1.60 1.38 1.34 1.38 1.36 128 


(Continued) 
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Day Day Day Day Day Day Day Day Day Day Day Day 
13 14 15 16 17 18 19 20 21 22 23 24 





1.26 1.52 1.50 1.42 1.32 1.16 1.24 1.30 1.30 148 1.32 1.44 
1.50 1.50 1.42 1.32 1.40 1.34 1.22 1.48 1.52 146 1.222 1.28 
1.52 1.46 1.38 1.48 1.40 1.40 1.20 128 1.76 1.48 1.72 1.10 
1.38 1.34 1.36 1.36 1.26 1.16 1.30 1.18 1.16 142 1.18 1.06 
1.50 1.40 1.38 1.38 1.26 1.54 1.36 128 1.28 1.36 1.36 1.10 





51. Examine the probability plot in Figure 9.6 and explain why there are several sets 
of horizontal bands of points. 


52. The following table gives values of two abundance ratios for different isotopes 
of potassium from several samples of minerals (H. Ku, private communication). 
Examine whether each of the ratios appears normally distributed by first making 
histograms and superposing normal densities and then making probability plots. 








9K ALK K/K RAK “K/K 9K ALK AN OK 
13.8645 576.369 13.8689 578.277 13.8724 576.017 
13.8695 578.012 13.8593 574.708 13.8665 574.881 
13.8659 575.597 13.8742 573.630 13.8566 — 578.508 
13.8622 575.244 13.8703 576.069 13.8555 576.796 
13.8696 575.567 13.8472 575.637 13.8534 580.394 
13.8604 576.836 13.8555 575.971 13.8685 576.772 
13.8672 576.236 13.8439 576.403 13.8694 576.501 
13.8598 575.291 13.8646 — 576.179 13.8599 574.950 
13.8641 576.478 13.8702 575.129 13.8605 577.614 
13.8673 576.992 13.8606 — 577.084 13.8619 574.506 
13.8597 . 578.335 13.8622 576.749 13.9641 576.317 
13.8604 576.767 13.8588 576.669 13.8597 575.665 
13.8591 576.571 13.8547 575.869 13.8617 575.815 
13.8472 576.617 13.8597 577.793 13.861 576.109 
13.863 575.885 13.8668 577.770 13.8615 576.144 
13.8566 576.651 13.8597 577.697 13.8469 576.820 
13.8503 575.974 13.8604 576.299 13.8582 576.672 
13.8553 577.255 13.8634 575.903 13.8645 576.169 
13.8642 574.664 13.8658 574.773 13.8713 575.390 
13.8613 576.405 13.8547 577.391 13.8593 575.108 
13.8706 574.306 13.8519 577.057 13.8522 576.663 
13.8601 577.095 13.863 577.286 13.8489 578.358 
13.866 576.957 13.8581 575.510 13.8609 575.371 
13.8655 576.434 13.8644 576.509 13.857 575.851 
13.8612 575.211 13.8665 574.300 13.8566 575.644 
13.8598 576.630 13.8648 575.846 13.864 574.462 











53. Hoaglin (1980) suggested a "Poissonness plot"—a simple visual method for 
assessing goodness of fit. The expected frequencies for a sample of size n from 


54. 


55. 


56. 


57. 
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a Poisson distribution are 
k 


yh 
E, =nP(X =k) =ne n 
or 


log E; = logn — X + klog à — log k! 


Thus, a plot of log( O4) + log k! versus k should yield nearly a straight line with 
a slope of log à and an intercept of logn — à. Construct such plots for the data 
of Problems 1, 2, and 3 of Chapter 8. Comment on how straight they are. 


A random variable X is said to follow a lognormal distribution if Y — log(X) 
follows a normal distribution. The lognormal is sometimes used as a model for 
heavy-tailed skewed distributions. 


a. Calculate the density function of the lognormal distribution. 
b. Examine whether the lognormal roughly fits the following data (Robson 1929), 
which are the dorsal lengths in millimeters of taxonomically distinct octopods. 


a. Generate samples of size 25, 50, and 100 from a normal distribution. Construct 
probability plots. Do this several times to get an idea of how probability plots 
behave when the underlying distribution is really normal. 

b. Repeat part (a) for a chi-square distribution with 10 df. 

c. Repeat part (a) for Y = Z/U, where Z ^ N(0, 1) and U ~ U[0, 1] and Z 
and U are independent. 

d. Repeat part (a) for a uniform distribution. 

e. Repeat part (a) for an exponential distribution. 

f. Can you distinguish between the normal distribution of part (a) and the sub- 
sequent nonnormal distributions? 


Suppose that a sample is taken from a symmetric distribution whose tails decrease 
more slowly than those of the normal distribution. What would be the qualitative 
shape of a normal probability plot of this sample? 


The Cauchy distribution has the probability density function 


1 1 
f= (ra) —oo«x«oo 
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What would be the qualitative shape of a normal probability plot of a sample 
from this distribution? 


58. Show how probability plots for the exponential distribution, F(x) = 1 — e ^", 
may be constructed. Berkson (1966) recorded times between events and fit them 
to an exponential distribution. (The times between events in a Poisson process 
are exponentially distributed.) The following table comes from Berkson's paper. 


Make an exponential probability plot, and evaluate its "straightness." 


Time Interval (sec) Observed Frequency 





0-60 115 
60-120 104 
120-181 99 
181-243 106 
243-306 113 
306—369 104 
369-432 101 
432-497 106 
497-562 104 
562-628 96 
628-698 512 
689-1130 524 
1130-1714 468 
1714-2125 531 
2125-2567 461 
2567-3044 526 
3044-3562 506 
3562-4130 509 
4130-4758 520 
4758-5460 540 
5460-6255 542 
6255-7174 499 
7174-8260 494 
8260-9590 500 
9590-11,304 550 
11,304-13,719 465 
13,719-14,347 104 
14,347-15,049 97 
15,049-15,845 101 
15,845-16,763 104 
16,763-17,849 92 
17,849-19,179 102 
19,179—20,893 103 
20,893-23,309 110 
23,309-27,439 112 
27,439+ 100 





59. Construct a hanging rootogram from the data of the previous problem in order to 
compare the observed distribution to an exponential distribution. 
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60. The exponential distribution is widely used in studies of reliability as a model 
for lifetimes, largely because of its mathematical simplicity. Barlow, Toland, and 
Freeman (1984) analyzed data on the strength of Kevlar 49/epoxy, a material 
used in the space shuttle. The times to failure (in hours) of 76 strands tested at a 
stress level of 90% are given in the following table. 





Times to Failure at 90% Stress Level 





.01 .01 .02 .02 .02 
.03 .03 .04 .05 .06 
.07 .07 .08 .09 .09 
.10 .10 T Ai 12 
3 .18 19 .20 .23 
24 24 29 .34 139 


36 38 .40 42 43 

2 54 56 .60 .60 

.63 .65 .67 .68 72 

12 72 73 719 79 

.80 .80 .83 .85 .90 

92 95 .99 1.00 1.01 
1.02 1.00 1.05 1.10 1.10 
1.11 1.15 1.18 1.20 1.29 
1.31 1.33 1.34 1.40 1.43 
1.45 1.50 1.51 1.52 1:53 
1.54 1.54 1.55 1.58 1.60 
1.63 1.64 1.80 1.80 1.81 
2.00 2.05 2.14 2.17 2.33 
3.00 3.00 324 420 4.69 
7.89 


a. Construct a probability plot of the data against the quantiles of an exponential 
distribution to assess qualitatively whether the exponential is a reasonable 
model. Can you explain the peculiar appearance of the plot? 

b. Compare the data to the exponential distribution by means of a hanging rooto- 
gram. 


61. The files haliburton and macdonalds give the monthly returns on the 
stocks of these two companies from 1975 through 1999. 


a. Make histograms of the returns and superimpose fitted normal densities. Com- 
ment on the quality of the fit. Which stock is more volatile? 
b. Make normal probability plots and again comment on the quality of the fit. 


62. Apply the Poisson dispersion test to the data on gamma-ray counts—Problem 42 
of Chapter 8. You will have to modify the development of the likelihood ratio 
test in Section 9.5 to take account of the time intervals being of different lengths. 


63. Construct a gamma probability plot for the data of Problem 46 of Chapter 8. 
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64. The file body t emp contains normal body temperature readings (degrees Fahren- 
heit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females 
(coded by 2) from Shoemaker (1996). 


65. 


a. 


Assess the normality ofthe male and female body temperatures by making nor- 
mal probability plots. In order to judge the inherent variability of these plots, 
simulate several samples from normal distributions with matching means and 
standard deviations, and make normal probability plots. What do you con- 
clude? 


. Repeat the preceding problem for heart rates. 
. For the males, test the null hypothesis that the mean body temperature is 98.6° 


versus the alternative that the mean is not equal to 98.6°. Do the same for the 
females. What do you conclude? 


This problem continues the analysis of the chromatin data from Problem 45 of 
Chapter 8 and is concerned with further examining goodness of fit. 


a. 


Goodness of fit can also be examined via probability plots in which the quan- 
tiles of a theoretical distribution are plotted against those of the empirical 
distribution. Following the discussion in Section 9.8, show that it is sufficient 
to plot the observed order statistics, X 4, versus the quantiles of the Rayleigh 
distribution with 0 = 1. Construct three such probability plots and comment 
on any systematic lack of fit that you observe. To get an idea of what sort of 
variability could be expected due to chance, simulate several sets of data from 
a Rayleigh distribution and make corresponding probability plots. 


. Formally test goodness of fit by performing a chi-squared goodness of fit test, 


comparing histogram counts to those predicted from the Rayleigh model. You 
may need to combine cells of the histograms so that the expected counts in 
each cell are at least 5. 


10.1 


CHAPTER 10 


Summarizing Data 


Introduction 


This chapter deals with methods of describing and summarizing data that are in the 
form of one or more samples, or batches. These procedures, many of which generate 
graphical displays, are useful in revealing the structure of data that are initially in 
the form of numbers printed in columns on a page or recorded on a tape or disk 
as a computer file. In the absence of a stochastic model, the methods are useful for 
purely descriptive purposes. If it is appropriate to entertain a stochastic model, the 
implications of that model for the method are of interest. For example, the arithmetic 
mean x is often used as a summary of a collection of numbers x1, x», ..., Xn; it 
indicates a "typical value." (We discuss some of its strengths and weaknesses in this 
regard in Section 10.4.) In some situations, it may be useful to model the collection 
of numbers as a realization of n independent random variables X,, X5, ..., X, with 
common mean jz and variance o°. The question of variability of X can be addressed 
with such a model—the mean x is regarded as an estimate of u, and we know from 
previous work that the stochastic model implies E(X) = u and Var(X) = o?/n. 
We will first discuss methods that are data analogues of the cumulative distri- 
bution function of a random variable. These methods are useful in displaying the 
distribution of data values. Next, we will discuss the histogram and related graphical 
displays that play the role for data that the probability density or frequency function 
plays for a random variable, giving a different view of the distribution of data values 
than that provided by the cumulative distribution function. We then discuss simpler 
numerical summaries of data, numbers that indicate a typical or central value of the 
data and a quantification of the spread. Such statistics provide a more condensed 
summary than do the cumulative distribution function and the histogram. We will 
pay particular attention to the effect of extreme data points on these measures. Next, 
we will introduce boxplots, graphical summaries that combine in a simple form 
information about the central values, spread, and shape of a distribution. Finally, 
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10.2 


10.2.1 


EXAMPLEA 


scatterplots are introduced as a method for displaying information about relation- 
ships of variables. 


Methods Based on the Cumulative 
Distribution Function 


The Empirical Cumulative Distribution Function 


Suppose that x;, ..., x, is a batch of numbers. (The word sample is often used in the 
case that the x; are independently and identically distributed with some distribution 
function; the word batch will imply no such commitment to a stochastic model.) The 
empirical cumulative distribution function (ecdf) is defined as 


1 
F,(x) = EAM < x) 


(With this definition, F, is right-continuous; in the former Soviet Union and Eastern 
Europe, the ecdf is usually defined to be left-continuous.) 

Denote the ordered batch of numbers by x < xa) € +++ € xo. Thenifx < xq, 
F(x) = 0, if xq) € x < xo, F,(x) = l/n,if xæ € x < Xa+), F(x) = k/n, and 
so on. If there is a single observation with value x, F, has a jump of height 1/n at x; 
if there are r observations with the same value x, F, has a jump of height r/n at x. 

The ecdf is the data analogue of the cumulative distribution function of a random 
variable: F (x) gives the probability that X < x and F, (x) gives the proportion of the 
collection of numbers less than or equal to x. 


As an example of the use of the ecdf, let us consider data taken from a study by White, 
Riethof, and Kushnir (1960) of the chemical properties of beeswax. The aim of the 
study was to investigate chemical methods for detecting the presence of synthetic 
waxes that had been added to beeswax. For example, the addition of microcrystalline 
wax raises the melting point of beeswax. If all pure beeswax had the same melting 
point, its determination would be a reasonable way to detect dilutions. The melting 
point and other chemical properties of beeswax, however, vary from one beehive to 
another. The authors obtained samples of pure beeswax from 59 sources, measured 
several chemical properties, and examined the variability of the measurements. The 
59 melting points (in ^C) are listed here. As a summary of these measurements, the 
ecdf is plotted in Figure 10.1. 


63.78 | 63.45 63.58 63.08 6340 6442 60327 63.10 
63.34 63.50 63.83 63.63 63.27 63.30 63.83 63.50 
63.36 63.86 63.34 63.92 63.88 63.36 63.36 63.51 
63.51 63.84 64.27 63.50 63.56 63.39 63.78 63.92 
63.02 63.56 63.43 64.21 64.24 6412 63.92 63.53 
63.50 63.30 63.86 63.93 63.43 6440 63.61 63.03 
63.68 63.13 63.41 63.60 63.13 63.60 63.05 62.85 
63.31 63.66 63.60 
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FIGURE 10.1 The empirical cumulative distribution function of the melting points 
of beeswax. 


Figure 10.1 conveniently summarizes the natural variability in melting points. 
For example, we can see from the graph that about 90% of the samples had melting 
points less than 64.2°C and that about 12% had melting points less than 63.2°C. 

White, Riethof, and Kushnir showed that the addition of 5% microcrystalline 
wax raised the melting point of beeswax by .85°C and the addition of 10% raised 
the melting point by 2.22°C. From Figure 10.1, we can see that an addition of 5% 
microcrystalline wax might well be difficult to detect, especially if it was made to 
beeswax that had a low melting point, but that an addition of 10% would be detectable. 
In further calculations, the investigators modeled the distribution of melting points as 
Gaussian. How reasonable does this model appear to be? E 


Let us briefly consider some of the elementary statistical properties of the ecdf 
in the case in which X,,..., X, is a random sample from a continuous distribution 
function, F. For purposes of analysis, it is convenient to express F, in the following 
way: 


1 n 
F,G) = = 5 ^ loss (X) 
i=l 


where 


» dox 
Tos (Xi) = b if X; >x 
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10.2.2 


The random variables [(_.5,.;(X;) are independent Bernoulli random variables: 


i (X) = l, with probability F (x) 
Co M 777 10, with probability 1 — F (x) 


Thus, n F,(x) is a binomial random variable (n trials, probability F(x) of success) 
and so 


E[F,(x)] = F(x) 
Var[F;(x)] = - FCOLI — F(x)] 


As an estimate of F(x), F, (x) is unbiased and has a maximum variance at that value 
of x such that F(x) — .5, that is, at the median. As x becomes very large or very 
small, the variance tends to zero. 

In the preceding paragraph, we considered F, (x) for fixed x; the results can be 
applied to form a confidence interval for F(x) for any given value of x. Much deeper 
analysis focuses on the stochastic behavior of F, as a random function; that is, all 
values of x are considered simultaneously. It turns out, somewhat surprisingly, that 
the distribution of 

_max |F,G) — FQ) 

does not depend on F if F is continuous. This result makes possible the construction 
of a simultaneous confidence band about F,, which can be used to test goodness 
of fit. [For further details, refer to Section 9.6 of Bickel and Doksum (1977).] It is 
important to realize the difference between the simultaneous confidence band and the 
individual confidence intervals that may be constructed using the binomial distribu- 
tion. Each such individual confidence interval covers F at one point with a certain 
probability, say, 1 — o, but the probability that all such intervals cover F simultane- 
ously is not necessarily 1 — o. We will encounter other phenomena of this type in later 
chapters. 


The Survival Function 


The survival function is equivalent to the cumulative distribution function and is 
defined as 


S(t) 2 P(T >t)=1— F(t) 


where T is a random variable with cdf F. In applications where the data consist of 
times until failure or death and are thus nonnegative, it is often customary to work 
with the survival function rather than the cumulative distribution function, although 
the two give equivalent information. Data of this type occur in medical and reliability 
studies. In these cases, S(t) is simply the probability that the lifetime will be longer 
than t. We will be concerned with the sample analogue of S, 


S, (t) =1- F,(t) 


which gives the proportion of the data greater than f. 


EXAMPLEA 


As an example, let us consider the use of the survival function with a study of the 
lifetimes of guinea pigs infected with varying doses of tubercle bacilli (Bjerkdal 
1960). In one study, five groups of 72 animals each were inoculated with the bacilli 
at increasing dosages, and a control group of 107 animals was used. We denote the 
inoculated groups by I, II, III, IV, and V, in order of increasing dose. The animals 
were observed over a 2-year period, and their times of death (in days) were recorded. 
The data are given here. Note that not all the animals in the lower-dosage regimens 
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died. 
Control Lifetimes 
18 36 50 52 86 87 89 91 
102 105 114 114 115 118 119 120 
149 160 165 166 167 167 173 178 
189 209 212 216 273 278 279 292 
341 355 367 380 382 421 421 432 
446 455 463 474 506 515 546 559 
576 590 603 607 608 621 634 634 
637 638 641 650 663 665 688 725 
735 
Dose I Lifetimes 
76 93 97 107 108 113 114 119 
136 137 138 139 152 154 154 160 
164 164 166 168 178 179 181 181 
183 185 194 198 212 213 216 220 
225 225 244 253 256 259 265 268 
268 270 283 289 291 311 315 326 
326 361 373 373 376 397 398 406 
452 466 592 598 
Dose II Lifetimes 
72 72 78 83 85 99 99 110 
113 113 114 114 118 119 123 124 
131 133 135 137 140 142 144 145 
154 156 157 162 162 164 165 167 
171 176 177 181 182 187 192 196 
211 214 216 216 218 228 238 242 
248 256 257 262 264 267 267 270 
286 303 309 324 326 334 335 358 
409 473 550 
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Dose III Lifetimes 











10 33 44 56 59 72 74 77 
92 93 96 100 100 102 105 107 
107 108 108 108 109 112 113 115 
116 120 121 122 122 124 130 134 
136 139 144 146 153 159 160 163 
163 168 171 172 176 183 195 196 
197 202 213 215 216 222 230 231 
240 245 251 253 254 254 278 293 
327 342 347 361 402 432 458 555 
Dose IV Lifetimes 
43 45 53 56 56 57 58 66 
67 33 74 79 80 80 81 81 
81 82 83 83 84 88 89 91 

91 92 92 97 99 99 100 100 
101 102 102 102 103 104 107 108 
109 113 114 118 121 123 126 128 
137 138 139 144 145 147 156 162 
174 178 179 184 191 198 211 214 
243 249 329 380 403 511 522 598 





Dose V Lifetimes 


12 15 22 24 24 32 32 33 
34 38 38 43 44 48 52 53 
54 54 55 56 57 58 58 59 
60 60 60 60 61 62 63 65 
65 67 68 70 70 72 73 75 
76 76 81 83 84 85 87 91 
95 96 98 99 109 110 121 127 
129 131 143 146 146 175 175 211 
233 258 258 263 297 341 341 376 


A plot (Figure 10.2) of the empirical survival functions provides a convenient 
summary of the data. The proportions surviving beyond given times are plotted; it is 
not necessary to know the actual lifetimes of the animals that survived beyond the 
termination of the study. The graph is a much more effective presentation of the data 
than the tabular listings. 

One of Bjerkdahl’s primary interests was comparing the effect increased exposure 
had on guinea pigs that had different levels of resistivity. Comparing groups III and 
V, for example, we see that the difference in lifetimes of the weakest guinea pigs (say 
the 10% weakest) from the two groups was about 50 days, whereas the difference in 
lifetimes for stronger animals increases to about 100 days. L| 
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FIGURE 10.2 Survival functions for guinea pig lifetimes. For purposes of visual clar- 
ity, the points have been joined by lines: The solid line corresponds to the control group, 
the dotted line to group I, the short-dash line to group II, the long-dash line to group III, 
the dot-and-long-dash line to group IV, and the short-and-long-dash line to group V. 


Survival plots may also be used for informal examinations of the hazard func- 
tion, which may be interpreted as the instantaneous death rate for individuals who 
have survived up to a given time. If an individual is alive at time f, the probability 
that that individual will die in the time interval (t, t + 5) is, assuming that the density 
function f is continuous at f, 

P(t <T <t+ô) 
P(T >t) 
F+) =F E) 
i 1— F(t) 
~ of) 
^ 1- Fit) 


Pt<T<t+6d|T>Hh= 





The hazard function is defined as 
t 
poe) 
1— F(t) 
and may be thought of as the instantaneous rate of mortality for an individual alive 
at time t. If T is the lifetime of a manufactured component, it may be natural to think 
of h(t) as the instantaneous or age-specific failure rate. It may also be expressed as 


h(t) = e 1— F(0]2 E S 
D s og[1 — ese og S(t) 


which reveals that it is the negative of the slope of the log of the survival function. 
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Consider, for example, the exponential distribution: 


F(t) 21—e^* 
S(t) 2e 
f(t) 2Ae* 
h(t) 2 X 


The instantaneous mortality rate is constant. If the exponential distribution were used 
as a model for the time until failure of a component, it would imply that the probability 
of the component failing did not depend on its age. This is a consequence of the 
"memoryless" property of the exponential distribution (Section 2.2.1). An alternative 
model might have a hazard function that is U-shaped, the rate of failure being high 
for very new components because of flaws in the manufacturing process that show 
up very quickly, declining for components of intermediate age, and then increasing 
for older components as they wear out. 

The empirical survival function and its logarithm can be expressed in terms of 
the ordered observations. For simplicity, suppose that there are no ties and that the 
ordered failure times are Ta) < To) < +- < To. Then if t = To, F,(t) = i/n and 
S, (t) = 1 — i/n. Since log S, (t) is then undefined for t > To), it is often defined as 
Si(t) = 1—i/(n + 1) for Ty € t < Tus». 


For the data of Example A, Figure 10.3 is a plot of the log of the empirical survival 
functions. We plotted log[1 — i/(n + 1)] versus the ordered survival times To). From 
the slopes of these curves, we see that the hazard functions are initially fairly small. 
As the dosage level increases, the instantaneous mortality rates both increase more 
quickly and reach higher levels. The increased mortality rate sets in at an earlier age 
for the high-dosage group and seems greater (the slope is greater). (To see this, hold 
the figure at an angle so that you are “looking down” the curves.) a 


When interpreting plots such as that presented in Figure 10.3, we will find it 
useful to keep in mind the variability of the empirical log survival function. Using the 
method of propagation of error (Section 4.6), we have 


Var(log[1 — F,(0]) 2 Sl PI 


Q 





[| — FP 
1 /FOU- FOI 
= (A) 


df F® 
=; (Fa) 
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FIGURE 10.3 Log survival functions for guinea pig lifetimes. For purposes of visual 
clarity, the points have been joined by lines: The solid line corresponds to the control 
group, the dotted line to group I, the short-dash line to group II, the long-dash line to 
group III, the dot-and-long-dash line to group IV, and the short-and-long-dash line 
to group V. 


From this expression, we see that for large values of t, the empirical log survival 
function is extremely unreliable, because | — F (t) is then very small. Thus, in practice, 
the last few data points are disregarded. (Note the large fluctuations of the log survival 
functions in Figure 10.3 for large times.) 


Quantile-Quantile Plots 


Quantile-quantile (Q-Q) plots are useful for comparing distribution functions. If X 
is a continuous random variable with a strictly increasing distribution function, F, 
the pth quantile of the distribution was defined in Section 2.2 to be that value of x 
such that 


F(x) =p 
or 
Xp = F (p) 


Ina Q-Q plot, the quantiles of one distribution are plotted against those of another. 
Suppose, for purposes of discussion, that one cdf (F) is a model for observations of 
a control group and another (G) is a model for observations of a group that has 
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received some treatment. Let the observations of the control group be denoted by x 
with cdf F, and let the observations of the treatment group be denoted by y with 
cdf G. The simplest effect that the treatment could have would be to increase the 
expected response of every member of the treatment group by the same amount, 
say h units. That is, both the weakest and the strongest individuals would have their 
responses changed by A. Then y, = x, +h, and the Q-Q plot would be a straight 
line with slope 1 and intercept ^. We will now show that this relationship between 
the quantiles implies that the cumulative distribution functions have the relationship 
G(y) = F(y — h). This follows, because for every 0 < p < 1, 


p= G(yp) 
= F(xp) 
= F(yp —h) 
as in Figure 10.4. 
1.0r 
8r E 
67 . 
$ ; 
Ar : 
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FIGURE 10.4 An additive treatment effect. The solid line is F(y), and the dotted 
line is G(y) = F (y — h). 


Another possible effect of a treatment would be multiplicative: The response 
(such as lifetime or strength) is multiplied by a constant, c. The quantiles would then 
be related as y, = cx,, and the Q-Q plot would be a straight line with slope c and 
intercept 0. The cdf's would be related as G( y) = F( y/c) (see Figure 10.5). 

A simple summary of a treatment effect for the additive model would be of the 
form "the treatment increases lifetime by 2 mo.” For the multiplicative model, one 
might say something like “the treatment increases lifetime by 25%.” 
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FIGURE 10.5 A multiplicative treatment effect. The solid line is F ( y), and the 
dotted line is G( y) = F( y/o). 


The effect of a treatment can, of course, be much more complicated than either of 
these two simple models. For example, a treatment could benefit weaker individuals 
and be to the detriment of stronger individuals. An educational program that places 
very heavy emphasis on elementary or basic skills might be expected to have this sort 
of effect relative to a regular program. 

Given a batch of numbers, or a sample from a probability distribution, quantiles 
are constructed from the order statistics. Given n observations and the order statistics 
Xa)» ..., Xo», the k/(n + 1) quantile of data is assigned to Xœ. (This convention 
is not unique; sometimes, for example, the quantile assigned to X; is defined as 
(k — .5)/n. For descriptive purposes, it makes little difference which definition we 
use.) In constructing probability plots in Chapter 9, we plotted sample quantiles 
defined as just described versus the quantiles of a theoretical distribution, such as the 
normal, and used these plots to informally assess goodness of fit. 

To compare two batches of n numbers with order statistics X(1),..., X and 
Yay, ..., Yn, a Q-Q plot is simply constructed by plotting the points (Xa), Ya). If 
the batches are of unequal size, an interpolation process can be used. A procedure for 
interpolating intermediate quantiles is described in the end-of-chapter problems. 


Cleveland et al. (1974) used Q-Q plots in a study of air pollution. They plotted the 
quantiles of distributions of the values of various variables on Sunday against the 
quantiles for weekdays (Figure 10.6). The Q-Q plot of the ozone maxima shows that 
the very highest quantiles occur on weekdays but that all the other quantiles are larger 
on Sundays. For carbon monoxide, nitrogen oxide, and aerosols, the differences in the 
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FIGURE 10.6 Q-Q plots of air pollution variables: (a) ozone maxima (ppm), (b) carbon 
monoxide concentration (ppm), (c) nitrogen oxide concentration (ppm), (d) nonmethane 
hydrocarbons (ppm), (e) solar radiation (langleys), (f) aerosols (ruds). 


EXAMPLE B 


quantiles increase with increasing concentration. The very high and very low quantiles 
of solar radiation are about the same on Sundays and weekdays (presumably corre- 
sponding to very clear days and days with heavy cloud cover), but for intermediate 
quantiles, the Sunday quantiles are larger. E 


Figure 10.7 is a Q-Q plot for groups III and V of Bjerkdahl (see Example A in 
Section 10.2.2). It shows that the difference in the quantiles increases for the larger 
quantiles; this is consistent with the observations we made earlier. From his analysis 
of the data, Bjerkdahl concluded that the increases were proportionally the same for 
animals with little, average, or great resistance—that is, that the treatment effect is 
multiplicative in the sense defined earlier. If this were the case, the Q-Q plot would 
be a straight line. For times up to about 200 days, the animals in group III live 
approximately twice as long as those in group V, but beyond 100 days the difference 
is roughly constant. The Q-Q plot thus provides a simple and effective means of 
comparing the lifetimes in the two groups. E 


Further discussion and examples of Q-Q plots can be found in Wilk and 
Gnanadesikan (1968). 
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FIGURE 10.7 Q-Q plots of groups IIl and V from Bjerkdahl (1960). For reference, 
the line y = x has been added. 


Histograms, Density Curves, 
and Stem-and-Leaf Plots 


The histogram, a time-honored method of displaying data, has already been intro- 
duced. It displays the shape of the distribution of data values in the same sense that 
a density function displays probabilities. The range of the data is divided into inter- 
vals, or bins, and the number or proportion of the observations falling in each bin 
is plotted. If the bins are not of equal size, the resulting histogram can be mislead- 
ing. A procedure that is often recommended is to plot the proportion of observations 
falling in the bin divided by the bin width; if this procedure is used, the area under 
the histogram is 1. 

Figure 10.8 shows three histograms of the melting points of beeswax from Ex- 
ample A in Section 10.2.1 with increasingly larger bin width. If the bin width is too 
small, the histogram is too ragged; if the bin is too wide, the shape is oversmoothed 
and obscured. The choice of bin width is usually made subjectively in an attempt to 
strike a balance between a histogram that is too ragged and one that oversmooths. 
Rudemo (1982) discusses automatic methods for choosing the bin width. 

Histograms are frequently used to display data for which there is no assumption 
of any stochastic model—for example, populations of U.S. cities. If the data are 
modeled as a random sample from some continuous distribution, the histogram may 
be viewed as an estimate of the probability density. Regarded in this light, it suffers 
from not being smooth. 

A smooth probability density estimate can be constructed in the following 
way. Let w(x) be a nonnegative, symmetric weight function, centered at zero and 
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FIGURE 10.8 Histograms of melting points of beeswax: (a) bin width = .1, (b) bin 
width = .2, (c) bin width = .5. 


integrating to 1. For example, w(x) can be the standard normal density. The function 


1 Xx 
w(x) = "ue (i) 


is a rescaled version of w. As h approaches zero, w; becomes more concentrated 
and peaked about zero. As h approaches infinity, w, becomes more spread out and 
flatter. If w(x) is the standard normal density, then w; (x) is the normal density with 
standard deviation h. If X;,..., X, is a sample from a probability density function, 
f.anestimate of f is 


1 n 
fio) = — 2 wa(x — X;) 


This estimate, called a kernel probability density estimate, consists of the super- 
position of “hills” centered on the observations. In the case where w(x) is the stan- 
dard normal density, w(x — X;) is the normal density with mean X; and standard 
deviation h. 

The parameter h, the bandwidth of the estimating function, controls its smooth- 
ness and corresponds to the bin width of the histogram. If h is too small, the estimate 
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FIGURE 10.9 Probability density estimates from melting point data. The kernel w is 
the standard normal density with standard deviation (a) .025, (b) .125, and (c) 1.25. 
Note that the vertical scales are different. 


is too rough; if it is too large, the shape of f is smeared out too much. Figure 10.9 
shows estimates of the probability density of the melting points of beeswax (from 
Example A in Section 10.2.1) for various values of h. Making a reasonable choice 
of the bandwidth is important, just as is choosing the bin width for a histogram. 
From Figure 10.9, we see that too small a bandwidth yields a ragged curve and too 
large a bandwidth obscures the shape and spreads the probability mass out too much. 
Scott (1992) contains extensive discussion of probability density estimation, includ- 
ing methods for automatic, data-driven bandwidth choice and estimation of densities 
in more than one dimension. 

One disadvantage of a histogram or a probability density estimate is that infor- 
mation is lost; neither allows the reconstruction of the original data. Furthermore, 
a histogram does not allow one to calculate a statistic such as a median; one can 
tell from a histogram only in which bin the median lies and not the median's actual 
value. 

Stem-and-leaf plots (Tukey 1977) convey information about shape while re- 
taining the numerical information. It is easiest to define this type of plot by an example, 
a stem-and-leaf plot of the beeswax melting-point data (the decimal point is one 
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place to the left of the colon): 


STEM LEAF 

1 628 :5 

1 629 : 

4 630 :358 

7 631 :033 

9 632 :77 
18 633 :001446669 
23 634 :01335 


N 
Os 
= 
NOWF OTOUNDNNANOTMNON WW Ore 


635  :0000113668 
636  :0013689 


19 637 :88 
17 638 :334668 
11 639 :22223 
6 640 : 
6 641 :2 
5 642 :147 
2 643 : 
2 644 :02 


The first three digits of the melting points have been selected to form the stem and are 
listed in the third column. The leaves on each stem are the fourth digit of all numbers 
with that stem. For example, the first stem is 628, and its leaf indicates the presence of 
the number 62.85 in the data. The third stem is 630, and its leaves indicate the presence 
of the numbers 63.03, 63.05, and 63.08. This stem-and-leaf plot was constructed by 
a computer, but they are very easy to make by hand. The second column of numbers 
gives the number of leaves on each stem. The first column of numbers facilitates 
finding order statistics, such as quartiles and the median; starting at the top of the 
plot and continuing down to the stem containing the median, the cumulative numbers 
of observations out to the smallest observation are listed. The numbering process 
is then extended symmetrically from the stem containing the median to the largest 
observation of the data. 

Straightforward stem-and-leaf plots do not work well for data that range over 
several orders of magnitude. In such a situation, it is better to make a stem-and-leaf 
plot of the logarithms of the data. 


Measures of Location 


Sections 10.2 and 10.3 were concerned with data analogues of the cumulative distri- 
bution and density functions and with related curves, which convey visual information 
about the shape of the distribution of the data. Here and in Section 10.5, we discuss 
simple numerical summaries of data that are useful when there is not enough data 
to justify constructing a histogram or an ecdf, or when a more concise summary is 
desired. 

A measure of location is a measure of the center of a batch of numbers. If 
the numbers result from different measurements of the same quantity, a measure of 
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location is often used in the hope that it is more accurate than any single measure- 
ment. In other situations, a measure of location is used as a simple summary of the 
numbers—for example, “the average grade on the exam was 72.” In this section, we 
will discuss several common measures of location and their relative advantages and 
disadvantages. 


The Arithmetic Mean 


The most commonly used measure of location is the arithmetic mean, 


n 
, Xi 
il 


For illustration, we consider a set of 26 measurements of the heat of sublimation of 
platinum from an experiment done by Hampson and Walker (1961). The data are 
listed here: 


SI! 


x= 





Heats of Sublimation of Platinum (kcal/mol) 





136.3 136.6 135.8 135.4 134.7 135.0 134.1 143.3 
147.8 148.8 134.8 135:2 134.9 146.5 141.2 135.4 
134.8 135.8 135.0 133.7 134.4 134.9 134.8 134.5 
134.3 135.2 





The 26 measurements are all attempts to measure the “true” heat of sublimation, 
and we see that there is variability among them. Intuitively, it may seem that a measure 
of location or center for this batch of numbers would give a more accurate estimate 
of the heat of sublimation than any one of the numbers alone. 

A common statistical model for the variability of a measurement process is the 
following: 


X; — ut B ei 


(See Section 4.2.1.) Here, X; is the value of the ith measurement, ju is the true value 
of the heat of sublimation, £ represents bias in the measurement procedure, and se; 
is the random error. The e; are usually assumed to be independent and identically 
distributed random variables with mean 0 and variance o?. The efficacy of measures 
of location is often judged by comparing their performances (mean squared error, for 
example) with this model. Note that with this model the data alone tell us nothing 
about £, the bias in the measurement procedure, which in some cases may be as or 
more important than the random variability. 

The observations are listed across rows in the order in which the experiments were 
done. When observations are acquired sequentially, it is often informative to plot them 
in order, as in Figure 10.10. From this plot, we see that the first few observations were 
somewhat high. The most striking aspect of the plot is the presence of five extreme 
observations that occurred in groups of three and two. Such observations, which are 
quite far from the bulk of the data, are called outliers. Outliers occur all too frequently, 
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FIGURE 10.10 Plot showing time sequence of measurements of heat of 
sublimation of platinum. 


even in carefully conducted studies. The outliers in this case might have been caused 
by improperly calibrated equipment, for example. Outliers can also be caused by 
recording and transcription errors or by equipment malfunctions. It is important to 
detect outliers, since they may have an undue influence on subsequent calculations. 
Graphical presentation is an effective means of detection. Careful reexamination of the 
data and the circumstances under which they were obtained can sometimes uncover 
the causes behind the outliers. Although outliers are often unexplainable aberrations, 
an examination of them and their causes can sometimes deepen an investigator's 
understanding of the phenomenon under study. 

Figure 10.10 also makes us doubt that the model for measurement error given 
above is appropriate for this set of data. The fact that the outliers occur in groups of 
two and three, rather than being randomly scattered, makes the independence model 
somewhat implausible. 

A stem-and-leaf plot provides another summary of this data (the decimal point 
is at the colon): 


] 1 1337 
4 3 134:134 
11 7 134:5788899 
6 135:002244 
9 2 135:88 
7 1 1363 
6 1 136:6 


High: 141.2 143.3 146.5 147.8 148.8 
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On this stem-and-leaf plot, the outlying observations have been isolated and 
flagged as high. 

In their analysis, Hampson and Walker set aside the seven largest observations 
and the smallest observation and found the average of the remaining observations to 
be 134.9. Calculated from all the observations, the arithmetic mean is 137.05. Note 
from the stem-and-leaf plot and from Figure 10.10 that this number is larger than the 
bulk of the data and is clearly not a good descriptive measure of the "center" of this 
batch of numbers. We would not be satisfied with it as an estimate of the true heat of 
sublimation. 

If the data are modeled as a sample from a probability law, as with the measure- 
ment error model described above, an approximate 100(1 — w)% confidence interval 
forthe population mean can be obtained from the central limit theorem as in Chapter 7. 
The interval is of the form 





X + z(a/2)ss 


Blindly applying this formula to the platinum data, with a = .05, we obtain the 
interval 137.05 + 1.71, or (135.3, 138.8). Note where this interval falls on the stem- 
and-leaf plot! 

Although the example presented here may be somewhat extreme, it illustrates the 
sensitivity of the sample mean to outlying observations. In fact, by changing a single 
number, the arithmetic mean of a batch of numbers can be made arbitrarily large or 
small. Thus, if used blindly, without careful attention to the data, the arithmetic mean 
can produce misleading results. When the data are automatically acquired, stored 
as files on disks or tapes, and not visually examined, this danger increases. For this 
reason, measures of location that are robust, or insensitive to outliers, are important. 





The Median 


If the sample size is an odd number, the median is defined to be the middle value 
of the ordered observations; if the sample size is even, the median is the average of 
the two middle values. Clearly, moving the extreme observations does not affect the 
sample median at all, so the median is quite robust. The median of the platinum data 
is 135.1, which, as can be seen from the stem-and-leaf plot, is more reasonable than 
the mean as a measure of the center. 

When the data are a sample from a continuous probability law, the sample me- 
dian can be viewed as an estimate of the population median, 7, for which a simple 
confidence interval can be formed. We will now demonstrate that this interval is of 
the form 


(X, Xe) 
The coverage probability of this interval is 


P(Xw € € Xa-k+1) 21 — P(n < Xæ orn > Xon) 
—1-—P(g«Xq)-—P(-Xqou)) 
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since the events are mutually exclusive. To evaluate these terms, we first note that 


k-1 
P(N > Xr) = 5 P( j observations are greater than 7) 
j=0 
k-1 
P(n < Xw) = 5 P( j observations are less than 7) 
j=0 


Since, by definition, the median satisfies 
P(X; > n) = P(X; <n) =} 


and since the n observations X;,..., X, are independent and identically distributed, 
the distribution of the nue of observations greater than the median is binomial 
with n trials and probability 4 5 of success on each trial. Thus, 


1 
P (exactly j observations are greater than 7) = 2 x) 


and 


k-1 
1 n 
P(n > X= kx) = 2n ) ( ) 


pem 


From symmetry, we then have that the coverage probability of the interval in question 


is 
k-1 
n—1 é ) 
E =0 


J= 





These probabilities can be found from tables of the cumulative binomial distribution 
since 
k=l 
1 n 
zm (7) = Pw s=» 
jo V 


where Y is a binomial random variable with n trials and probability of success equal 
to 1. 
2 


As a concrete example, with n = 26, we have the following cumulative binomial 
probabilities: 





k P(Y <k) 
5 .0012 
6 .0047 
7 0145 
8 .0378 
9 0843 
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If we choose k = 8, 
P(Y < k) = .0145 


and since P(Y < k) = P(Y >n — k + 1), P(Y > 19) = .0145. Since 2 x .0145 = 
.029, the interval (X(g), Xa9)) is a 9796 confidence interval. Note that this confidence 
interval is exact, not approximate, and does not depend on the form of the underlying 
cdf but only on the assumption that the cdf is continuous and that the observations 
are independent. 

For the platinum data, this confidence interval is (134.8, 135.8). Compare this 
interval to the interval based on the sample mean. (But as we noted, there is reason 
to doubt the independence assumption for the platinum data, so these calculations 
should be viewed as an illustrative numerical exercise.) E 


The Trimmed Mean 


Another simple and robust measure of location is the trimmed mean. The 100a@% 
trimmed mean is easy to calculate: Order the data, discard the lowest 1000/96 and the 
highest 1000/6, and take the arithmetic mean of the remaining data. It is generally 
recommended that the value chosen for o be from .1 to .2. Formally, we may write 
the trimmed mean as 


"AT X([no]--1) Tec X(n—[na]) 


a> 





n — 2[na] 
where [na] denotes the greatest integer less than or equal to no. Note that the median 
can be regarded as a 50% trimmed mean. 

The 20% trimmed mean for the platinum data listed in Section 10.4.1 is formed 
by discarding the highest and lowest five observations (.2 x 26 — 5.2) and averaging 
the rest. The result is 135.29; for the same data, the median was 135.1 and the mean 
was 137.05. 


M Estimates 


The sample mean is the mle of jz, the location parameter, when the underlying distribu- 
tion is normal. Equivalently, the sample mean minimizes the negative log likelihood, 


Or 

Y (= iH ) : 

i=l a 
This is the simplest case of a least squares estimate. (We will discuss least squares 
estimates in more detail in the context of curve fitting.) Outliers have a great effect 
on this estimate, since the deviation of u from X; is measured by the square of 
their difference. In contrast, the median is the minimizer of (see Problem 34 of the 
end-of-chapter problems) 
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Here, large deviations are not weighted as heavily, and it is this property that causes 
the median to be robust. 

Huber (1981) proposed a class of estimates, M estimates, which are the mini- 
mizers of 





where the weight function V is a compromise between the weight functions for the 
mean and the median. A wide variety of weight functions have been proposed. Huber 
discusses weight functions that are quadratic near zero and are linear beyond a cutoff 
point, k. Thus, k — oo corresponds to the mean and k — 0 to the median. A common 
choice is k = 1.5. With this choice, the influence of observations more than 1.50 
away from the center is reduced. In practice, a robust estimate of c', such as those 
discussed in Section 10.5, must be used. 

The computation of an M estimate is a nonlinear minimization problem and must 
be done iteratively (using the Newton-Raphson method, for example). If M is a convex 
function, the minimizer will be unique. Fairly simple computer programs that do this 
are common in statistical packages. The M estimate (k — 1.5) for the platinum data 
we have been considering is 135.38, close to the median (135.1) and the trimmed 
mean (135.29) but quite different from the mean (137.05). 


Comparison of Location Estimates 


We introduced several location estimates (and there are many others). Which one 
is best? There is no simple answer to this question. It is always important to bear 
in mind what is being estimated by the location estimates and to what purpose the 
estimate is being put. If the underlying distribution is symmetric, the trimmed mean, 
the sample mean, the sample median, and an M estimate all estimate the center of 
symmetry. If the underlying distribution is not symmetric, however, the four statistics 
estimate four different population parameters: the population mean, the population 
median, the population trimmed mean, and a functional of the cdf determined by the 
weight function V. Moreover, there is no single estimate that is best for all symmetric 
distributions. Life isn't that simple. Simulations have been done to compare estimates 
for a variety of distributions. Andrews et al. (1972) report the results of a large number 
of simulations from symmetric distributions. Their results show that the 1096 or 2096 
trimmed mean is overall quite an effective estimate: Its variance is never much larger 
than the variance of the ordinary mean (even in the Gaussian case for which the 
mean is optimal) and can be quite a lot smaller when the underlying distribution 
is heavy-tailed relative to the Gaussian. The median, although quite robust, has a 
substantially larger variance in the Gaussian case than does the trimmed mean. The 
trimmed mean and the median have a certain appealing simplicity and are easy to 
explain to someone who has little formal statistical training. M estimates performed 
quite well in the simulations of the Andrews et al. study, and they do generalize 
more naturally to other problems such as curve fitting. But they are somewhat more 
difficult to compute and have less immediate intuitive appeal. For the purpose of 
simply summarizing data, it is often useful to compute more than one measure of 
location and compare the results. 
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Estimating Variability of Location Estimates 
by the Bootstrap 


If we view the observations x1, x», ..., x, as realizations of independent random 
variables with common distribution function F, it is appropriate to investigate the 
variability and sampling distribution of a location estimate calculated from a sample 
of size n. Suppose we denote the location estimate as 0; it is important to keep in 
mind that @ is a function of the random variables X 1, X2, ..., X, and hence has a 
probability distribution, its sampling distribution, which is determined by n and F. We 
would like to know this sampling distribution, but we are faced with two problems: 
(1) We don't know F, and (2) even if we knew F, 6 may be such a complicated 
function of X1, X2,..., X, that finding its distribution would exceed our analytic 
abilities. 

First, we address the second problem. Suppose then, for the moment, that we 
knew F. How could we find the probability distribution of Ó without going through 
incredibly complicated analytic calculations? The computer comes to our rescue—we 
can do it by simulation. We generate many, many samples, say B in number, of size 
n from F; from each sample we calculate the value of 6. The empirical distribution 
of the resulting values 0f, 07, ..., 05 is an approximation to the distribution function 
of Ô, which is good if B is very large. If we want to know the standard deviation of 
Ó, we can find a good approximation to it by calculating the standard deviation of 
the collection of values 0f, 07, ..., 05. We can make these approximations arbitrarily 
accurate by taking B to be arbitrarily large. 

All this would be well and good if we knew F, but we don’t. So what do we do? 
The bootstrap solution is to view the empirical cdf F, as an approximation to F and 
sample from F,. That is, F, would be used in place of F in the previous paragraph. 
How do we go about sampling from F,? F, is a discrete probability distribution 
that gives probability 1/n to each observed value x1, x2, ..., Xn. A sample of size 
n from F, is thus a sample of size n drawn with replacement from the collection 
X1, X2, ... , Xn. We thus draw B samples of size n with replacement from the observed 
data, producing 0f, 07, ..., 05. The standard deviation of 6 is then estimated by 





ee - 
$$ = 22 059 
i=1 


where 6* is the mean of 07, 07, ..., 05. 


We illustrate this idea on the platinum data by using the bootstrap to approximate the 
sampling distribution of the 20% trimmed mean and its standard error. To this end, 
1000 samples of size n = 26 were drawn randomly with replacement from the collec- 
tion of 26 values. A histogram of the 1000 trimmed means is displayed in Figure 10.11. 
The standard deviation of the 1000 values was .64, which is the estimated standard 
error of the 20% trimmed mean. The histogram is interesting—note the skewed tail to 
the right. We see that some of the trimmed means were far from the bulk of the data. 
This happened because some of the samples drawn with replacement included several 
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FIGURE 10.11 Histogram of 1000 bootstrap 20% trimmed means. 


replicates of the five outliers (see Figure 10.10). The computer calculation is telling 
us that if we sample from F;, the 20% trimmed mean is not as robust as we might 
like; this is an extremely heavy-tailed distribution and a sample of 26 may contain a 
large number of outliers. 

As in Chapter 8, we can use the bootstrap distribution to form an approximate 
90% confidence interval. We proceed as in Examples D and E of Section 8.5.3, which 
you may want to review at this time. Denote the trimmed mean of the sample by 
6 = 135.29, and denote the 1000 ordered bootstrap trimmed means by 07, < 06) € 


ay = 
- € oo) Then the .05 quantile of the bootstrap distribution is @ = 05e, = 


134.00, and the .95 quantile is 0 = Oso, = 136.93. Following the notation of the 


examples of Section 8.5.3, the approximate 90% confidence interval is (6 — 8, 6 — 8), 
where 
6—-5=6-—(-6) 
= 20-6 
= 133.65 
and 
0—82ó6-—(90—0) 
= 20-86 
= 135.58 
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FIGURE 10.12  Histogram of 1000 bootstrap medians. 


Figure 10.12 is a histogram of 1000 bootstrapped medians. It is less dispersed 
than the histogram of trimmed means; the standard deviation of the medians is .24, 
considerably less than that of the trimmed mean. The bootstrap simulation is telling 
us that when sampling from a distribution like this, the median is more robust than 
the 20% trimmed mean. E 


How accurate are these bootstrap estimates? It is difficult to answer this question 
in a useful, explicit manner. Essentially, the accuracy depends on two factors: (1) the 
accuracy of F, as an estimate of F, and (2) the dependence of the distribution of the 
statistic Ó on F. For example, if the distribution of Ó changes little if F changes, then 
F, need not be a very good estimate of F, whereas if the distribution of Ó is extremely 
sensitive to F, F, will have to be a good estimate of F, and hence the sample size 
will have to be large, in order for the bootstrap approximation to be accurate. 


Measures of Dispersion 


A measure of dispersion, or scale, gives a numerical indication of the "scatteredness" 
of a batch of numbers. Simple summaries of data often consist of a measure of location 
and a measure of dispersion. The most commonly used measure is the sample standard 
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deviation, s, which is the square root of the sample variance, 


1 n E 
ea Ya- xy 
i=l 


n—1^4 





Using n — 1 as the divisor rather than the more obvious divisor n is based on the 
rationale that s? is an unbiased estimate of the population variance if the observations 
are independent and identically distributed with variance o?. (But s is not an unbiased 
estimate of ø because the square root is a nonlinear function.) If n is of moderate to 
large size, it makes little difference whether n or n — 1 is used. 

If the observations are a sample from a normal distribution with variance o?, 


(n — Ds? 2 
E ~ Xn=1 
This distributional result may be used to construct confidence intervals for o? in the 
normal case (compare with Example A in Section 8.5.3), but the result is not robust 
against deviations from normality. 

Like the sample mean, the sample standard deviation is sensitive to outlying 
observations. Two simple robust measures of dispersion are the interquartile range 
(IQR), or the difference between the two sample quartiles; (the 25th and 75th per- 
centiles) and the median absolute deviation from the median (MAD). If the data 
are X1, ..., x, with median X, the MAD is defined to be the median of the numbers 
|x; — X|. These two measures of dispersion, the IQR and the MAD, can be converted 
into estimates of o for a normal distribution by dividing them by 1.35 and .675, 
respectively. David (1981) discusses a method for finding a confidence interval for 
the population interquartile range, using reasoning similar to that used in Section 
10.4.2 for developing a confidence interval for the population median. 

Let us compare all three measures of dispersion for the platinum data: 


s —445 
IOR 

TOR _ 16 
1.35 
MAD 
DES Fog 
695 


The two robust estimates are similar. From the stem-and-leaf plot of the platinum 
values presented earlier, we can see that both the IQR and the MAD give measures of 
the spread of the central portion of the data, whereas the standard deviation is heavily 
influenced by the outliers. 


Boxplots 


A boxplot is a graphical display invented by Tukey that shows a measure of location 
(the median), a measure of dispersion (the interquartile range), and the presence of 
possible outliers and also gives an indication of the symmetry or skewness of the 
distribution. Figure 10.13 is a boxplot of the platinum data. 
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FIGURE 10.13  Boxplot of the platinum data. 


We outline the construction of a boxplot: 


1. Horizontal lines are drawn at the median and at the upper and lower quartiles 
and are joined by vertical lines to produce the box. 

2. A vertical line is drawn up from the upper quartile to the most extreme data point 
that is within a distance of 1.5 (IQR) of the upper quartile. A similarly defined 
vertical line is drawn down from the lower quartile. Short horizontal lines are 
added to mark the ends of these vertical lines. 

3. Each data point beyond the ends of the vertical lines is marked with an asterisk 
or dot (* or -). 


Boxplots are not uniformly standardized, but the basic structure is as outlined 
above, perhaps with additional embellishments or small variations. A boxplot thus 
gives an indication of the center of the data (the median), the spread of the data 
(the interquartile range), and the presence of outliers, and indicates the symmetry or 
asymmetry of the distribution of data values (the location of the median relative to the 
quartiles). In Figure 10.13, the five outliers of the platinum data are clearly displayed, 
and we see an indication that the central part of the distribution is somewhat skewed 
toward high values. 


Figure 10.14 is taken from Chambers et al. (1983). The data plotted are daily maximum 
concentrations in parts per billion of sulfur dioxide in Bayonne, N.J., from November 
1969 to October 1972 grouped by month. There are thus 36 batches, each of size 
about 30. The investigators concluded: 


The boxplots ... show many properties of the data rather strikingly. There is 
a general reduction in sulphur dioxide concentration through time due to the 
gradual conversion to low sulphur fuels in the region. The decline is most 
dramatic for the highest quantiles. Also, there are higher concentrations 
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FIGURE 10.14 Boxplots of daily maximum concentrations of sulfur dioxide. 


during the winter months due to the use of heating oil. In addition, the 
boxplots show that the distributions are skewed toward high values and 
that the spread of the distributions ... is larger when the general level of 
concentration is higher. 


The boxplot is clearly a very effective method of presenting and summarizing 
these data. As they are in this example, boxplots are generally useful for comparing 
batches of numbers, a purpose to which they will be put in the next two chapters. NM 


Exploring Relationships with Scatterplots 


Many interesting questions in statistics involve trying to understand the relation- 
ships among variables. The scatterplot is a basic method for displaying the empirical 
relationship between two variables based on a collection of pairs (x;, y;): one merely 
plots the points in the x y plane. This basic display can be augmented in various ways, 
as we will illustrate with some examples. 


Allison and Cicchetti (1976) examined the relationships of possible correlates of sleep 
behavior in mammals. Figure 10.15 is a scatterplot of total sleep versus brain weight. 
Other than that two mammals with very large brains slept very little, no relationships 
are apparent in the plot. There is in fact a relationship, but it is obscured in the plot 
because brain weights vary over orders of magnitude—the brain of the lesser short- 
tailed shrew weighs 0.14 grams, and at the other extreme the brain of the African 
elephant weighs 5,712 grams. It is thus much more informative to plot sleep versus 
the logarithm of brain weight, and annotating the plot helps further—as shown in 
Figure 10.16. It is now clear that mammals with heavier brains tend to sleep less. 
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FIGURE 10.15 Sleep versus brain weight for a collection of mammals. 
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FIGURE 10.16 Sleep versus logarithm of brain weight. 
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Data on these and other variables (how much do elephants dream?) can be found at 
http://lib.stat.cmu.edu/datasets/sleep. E 


Correlation coefficients are often used as a simple numerical summary of the 
strength of a relationship. The Pearson correlation coefficient corresponding to the 
pairs (xi, yi) is 


= oO; — X) — y) 
v»5Gi; — XY 30% — 3» 








This statistic measures the strength of a linear relationship. The correlation of brain 
weight and sleep is —0.36, and the correlation between the logarithm of brain weight 
and sleep is —0.56. These are different because a nonlinear transformation has been 
applied and the correlation coefficient measures the strength of a linear relationship. 
An alternative to the Pearson correlation coefficient is the rank correlation coeffi- 
cient: the brain weights are replaced by their ordered ranks (1, 2, ...), the sleep- 
ing times are replaced by their ranks, and then the Pearson correlation coefficient 
of the pairs of ranks is computed. The rank correlation turns out to be —0.39 in 
our example. Some advantages of the rank correlation coefficient are that it is in- 
sensitive to outliers and is invariant under any monotone transformation (thus the 
rank correlation does not depend on whether brain weight or log brain weight is 
used). 

Arrays of scatterplots are useful for examining the relationships among more 
than two variables, as illustrated in the following example. 


Inductive loop detectors are wire loops embedded in the pavement of road ways. They 
operate by detecting the change in inductance caused by the metal in vehicles that 
pass over them. During successive intervals of time, a detector reports the number 
of passing vehicles, and the percentage of time that it was covered by a vehicle. The 
number of vehicles is called flow, the percentage of coverage is called the occupancy. 
Such detectors are widely used to measure freeway traffic but are subject to various 
kinds of malfunction. Faulty detectors must be identified by traffic management cen- 
ters. One key to detecting malfunction is knowing that measurements in the several 
freeway lanes at a particular location should be highly related—the increases and 
decreases of traffic flow in one lane should tend to be mirrored in other lanes. Fig- 
ure 10.17 shows an array of scatterplots of occupancy measured by detectors in four 
lanes at a particular location (Bickel et al. 2004). The detectors in lanes three and 
four were closely related to each other at all times and were correlated with measure- 
ments in lanes one and two some, but not all, of the time. Apparently the detectors 
in lanes 1 and 2 malfunctioned some of the time while this set of measurements was 
taken. L| 
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FIGURE 10.17 
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Occupancy measurements by adjacent loops in four lanes. 


Concluding Remarks 


This chapter introduced several tools for summarizing data, some of which are graph- 
ical in nature. Under the assumption of a stochastic model for the data, some aspects 
of the sampling distributions of these summaries have been discussed. Summaries are 
very important in practice; an intelligent summary of data is often sufficient to fulfill 
the purposes for which the data were gathered, and more formal techniques such 
as confidence intervals or hypothesis tests sometimes add little to an investigator's 
understanding. Effective summaries can also point to “bad” data or to unexpected 
aspects of data that might have gone unnoticed if the data had been blindly crunched 
by a computer. 

We saw the bootstrap appear again as a method for approximating a sampling 
distribution and functionals of it such as its standard deviation. The bootstrap, a 
relatively recent development in statistical methodology, relies on the availability of 
powerful and inexpensive computing resources. Our development of approximate 
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confidence intervals based on the bootstrap followed that of Chapter 8, where we 
motivated the construction by using the bootstrap distribution of 0* — Ó to approximate 
the distribution of Ó — 0). We note that another popular method, known as the bootstrap 
percentile method, gives the interval (0, 0) (see Example A of Section 8.5.3 for 
definition of the notation). The rationale for this is harder to understand. More accurate 
methods for constructing bootstrap confidence intervals have been proposed and are 
under study, but we will not pursue these developments. 


Problems 


1. Plot the ecdf of this batch of numbers: 1, 14, 10, 9, 11, 9. 


2. Suppose that X1, X5, ..., X, are independent U[0, 1] random variables. 


a. Sketch F(x) and the standard deviation of F, (x). 
b. Generate many samples of size 16 on a computer; for each sample, plot F, (x) 
and F,(x) — F(x). Relate what you see to your answer to (a). 


3. From Figure 10.1, roughly what are the upper and lower quartiles and the median 
of the distribution of melting points? 


4. In Section 10.2.1, it was claimed that the random variables [(_.o,,)(X;) are inde- 
pendent. Why is this so? 


5. Let X1, ..., X, bea sample (i.i.d.) from a distribution function, F, and let F, 
denote the ecdf. Show that 


1 
Cov [F u), F,(v)] = "o — F(u)FQ)] 


where m = min(u, v). Conclude that F,(u) and F,(v) are positively correlated: 
If F,(u) overshoots F(u), then F,(v) will tend to overshoot F (v). 


6. Various chemical tests were conducted on beeswax by White, Riethof, and 
Kushnir (1960). In particular, the percentage of hydrocarbons in each sample 
of wax was determined. 


a. Plot the ecdf, a histogram, and a normal probability plot of the percentages of 
hydrocarbons given in the following table. Find the .90, .75, .50, .25, and .10 
quantiles. Does the distribution appear Gaussian? 


14.27 14.80 12.28 17.09 15.10 12.92 15.56 15.38 
15.15 13.98 14.90 15.91 14.52 15.63 13.83 13.66 
13.98 14.47 14.65 14.73 15.18 14.49 14.56 15.03 
15.40 14.68 13.33 14.41 14.19 15.21 14.75 14.41 
14.04 13.68 15.31 14.32 13.64 14.77 14.30 14.62 
14.10 15.47 13.73 13.65 15.02 14.01 14.92 15.47 
13:75 14.87 15.28 14.43 13.96 14.57 15.49 15.13 
14.23 14.44 14.57 


b. The average percentage of hydrocarbons in microcrystalline wax (a synthetic 
commercial wax) is 85%. Suppose that beeswax was diluted with 1% micro- 
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crystalline wax. Could this be detected? What about a 396 or a 596 dilution? 
(Such questions were one of the main concerns of the beeswax study.) 


. Compare group I to group V in Figure 10.2. Roughly, what are the differences in 


lifetimes for the animals that are the 10% weakest, median, and 10% strongest? 


. Consider a sample of size 100 from an exponential distribution with parameter 


=l. 

a. Sketch the approximate standard deviation of the empirical log survival func- 
tion, log 5, (1), as a function of t. 

b. Generate several such samples of size 100 on a computer and for each sample 
plot the empirical log survival function. Relate the plots to your answer to (a). 


. Use the method of propagation of error to derive an approximation to the bias of 


the log survival function. Where is this bias large, and what is its sign? 


Let X;,..., X, be a sample from cdf F and denote the order statistics by 
Xa Xo, ..., Xo. We will assume that F is continuous, with density func- 
tion f. From Theorem A in Section 3.7, the density function of X( is 


n 


SJEO- rcor* roo 
k—1 


fo) = «( 


a. Find the mean and variance of X) from a uniform distribution on [0, 1]. You 
will need to use the fact that the density of Xœ integrates to 1. Show that 


Mean = ——— 
n+l 


. 1 k k 
Variance = 1 
n+2\n+1 n+1 


b. Find the approximate mean and variance of Y), the kth-order statistic of a 
sample of size n from F. To do this, let 





X; = F(Yj) 
or 
Y; = F`! (X;) 


The X; are a sample from a U [0, 1] distribution (why?). Use the propagation 
of error formula, 


Yu = F! (Xo) 


F! i +(X UNE F! (x)| 
N — x 
n+! 0 n+l) dx k/n+1) 
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11. 


12. 


13. 
14. 
15. 


16. 


17. 


18. 


19. 


and argue that 


af k 
i d n+1 


k k 1 : 
Var (Yœ) © de] ( nd ;) CFUF- [k/ (n + D]? (3) 


c. Use the results of parts (a) and (b) to show that the variance of the pth sample 
quantile is approximately 








1 
Ll wn 
o p) 


where x, is the pth quantile. 

d. Use the result of part (c) to find the approximate variance of the median of a 
sample of size n from a N (u, o?) distribution. Compare to the variance of the 
sample mean. 

Calculate the hazard function for 

FHst=e". £20 
Let f denote the density function and h the hazard function of a nonnegative 
random variable. Show that 
f(t) = h(t)e- Ji h(s)ds 
that is, that the hazard function uniquely determines the density. 
Give an example of a probability distribution with increasing failure rate. 


Give an example of a probability distribution with decreasing failure rate. 


A prisoner is told that he will be released at a time chosen uniformly at random 
within the next 24 hours. Let T denote the time that he is released. What is the 
hazard function for T? For what values of t is it smallest and largest? If he has 
been waiting for 5 hours, is it more likely that he will be released in the next few 
minutes than if he has been waiting for 1 hour? 


Suppose that F is N (0, 1) and G is N(1, 1). Sketch a Q-Q plot. Repeat for G 
being N(1, 4). 
Suppose that F is an exponential distribution with parameter A = 1 and that 


G is exponential with A = 2. Sketch a Q-Q plot. 


A certain chemotherapy treatment for cancer tends to lengthen the lifetimes of 
very seriously ill patients and decrease the lifetimes of the least ill patients. 
Suppose that an experiment is done that compares this treatment to a placebo. 
Draw a sketch showing the qualitative behavior of a Q-Q plot. 
Consider the two cdfs: 

F(x) =x, O<x<1l 

G(x) = x’, O<x<l 


Sketch a Q-Q plot of F versus G. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 
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Sketch what you would expect the qualitative shape of the hazard function of 
human mortality to look like. 


Make Q-Q plots for other pairs of treatment groups from Bjerkdahl’s data (see 
Example A in Section 10.2.2). Does the model of a multiplicative effect appear 
reasonable? 


By examining the survival function of group V of Bjerkdahl's data (see Ex- 
ample A in Section 10.2.2), make a rough sketch of the qualitative shape of a 
histogram. Then make a histogram, and compare it to your guess. 


In the examples of Q-Q plots in the text, we only discussed the case in which 
quantiles of equal size batches are compared. From two batches of size n the 
k/ (n + 1) quantiles are estimated as Xœ and Y, so one merely has to plot X x) 
vs. Yæ. Write down a linear interpolation formula for the pth quantile where 
k/(n+ 1) € p < (k + 1)/(n + 1). Now suppose that the batch sizes are not the 
same, being m and n, m « n say. A Q-Q plot may be constructed by fixing the 
quantiles k/(m + 1) of the smaller data set and interpolating these quantiles for 
the larger data set. 

Interpolate to find the upper and lower quartiles of the following batch of 
numbers: 1, 2, 3, 4, 5, 6. 


Show that the probability plots discussed in Section 9.9 are Q-Q plots of the 
empirical distribution F, versus a theoretical distribution F. 


In Section 10.2.3, it was claimed that if y, = cx,, then G( y) = F( y/c). Justify 
this claim. 


Hampson and Walker also made measurements of the heats of sublimation of 
rhodium and iridium. Do the following calculations for each of the two given 
sets of data: 


. Make a histogram. 

. Make a stem-and-leaf plot. 

. Make a boxplot. 

. Plot the observations in the order of the experiment. 

. Does that statistical model of independent and identically distributed mea- 

surement errors seem reasonable? 

f. Find the mean, 1096 and 2096 trimmed means, and median and compare them. 

g. Find the standard error of the sample mean and a corresponding approximate 
90% confidence interval. 

h. Find a confidence interval based on the median that has as close to 9096 
coverage as possible. 

i. Use the bootstrap to approximate the sampling distributions of the 1096 and 
2096 trimmed means and their standard errors and compare. 

j. Use the bootstrap to approximate the sampling distribution of the median and 
its standard error. Compare to the corresponding results for trimmed means 
above. 

k. Find approximate 90% confidence intervals based on the trimmed means and 

compare to the intervals for the mean and median found previously. 


ona me f} 
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27. 


28. 


29. 


30. 


31. 





Iridium (kcal/mol) 





136.6 145.2 2151.5 162.7 159.1 159.8 160.8 173.9 160.1 
160.4 161.1 160.6 1160.2 159.5 160.3 159.2 159.3 159.6 
160.0 160.2 160.1 160.0 159.7 159.5 159.5 159.6 159.5 





Rhodium (kcal/mol) 





126.4 135.7 132.9 131:5 131.1 131.1 131.9 132.7 
133.3 132.5 133.0 133.0 132.4 131.6 132.6 1322 
131.3 131.2 132.1 131.1 131.4 1312 131.1 131.1 
134.2 133.8 133.3 133.5 133.4 133.5 133.0 132.8 
132.6 133.3 133.5 133.5 132.3 132.7 132.9 134.1 





Demographers often refer to the hazard function as the “age specific mortality 
rate,” or death rate. Until recently, most researchers in the field of gerontology 
thought that a death rate increasing with age was a universal fact in the biological 
world. There has been heavy debate over whether there is a genetically pro- 
grammed upper limit to lifespan. Using a facility in which sterilized medflies are 
bred to be released to fight medfly infestations in California, James Carey and co- 
workers (Carey et al. 1992) bred more than a million medflies and recorded their 
pattern of mortality. The data file medflies, contains the number of medflies 
alive from an initial population of 1,203,646 as a function of age in days. Using 
these data, estimate and plot the age specific mortality rate. Does it increase with 
age? 


For a sample of size n = 3 from a continuous probability distribution, what 
is P(Xa < n < Xy), where 5 is the median of the distribution? What is 
P(Xq) «nn-« X)? 


Of the 26 measurements of the heat of sublimation of platinum, 5 are outliers 
(see Figure 10.10). Let N denote the number of these outliers that occur in a 
bootstrap sample (sample with replacement) of the 26 measurements. 


a. Explain why the distribution of N is binomial. 

b. Find P(N > 10). 

c. In 1000 bootstrap samples, how many would you expect to contain 10 or more 
of these outliers? 

d. What is the probability that a bootstrap sample is composed entirely of these 
outliers? 


In Example A of Section 10.4.6, a 9096 bootstrap confidence interval based on 
the trimmed mean was found to be (133.65, 135.58). Compare these values to the 
list of data values given in Section 10.4.1 and observe that 133.65 is smaller than 
the smallest observation. Explain why the bootstrap confidence interval extends 
so far in this direction. 


We have seen that the bootstrap entails sampling with replacement from the 
original observations. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 
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a. If the original sample is of size n, how many samples with replacement are 
there? 

b. Suppose for pedagogical purposes that n = 3 and we have the following 
observations: 1, 3, 4. List all the possible samples with replacement. 

c. Now suppose that we want to find the bootstrap distribution of the sample 
mean. For each of the preceding samples, calculate the mean and use these 
results to construct the bootstrap distribution of the sample mean. 

d. Based on the bootstrap distribution, what is the standard error of the sample 
mean? Compare this to the usual estimated standard error, sy. 


Explain how the bootstrap could be used to approximate the sampling distribution 
of the MAD. 


Which of the following statistics can be made arbitrarily large by making one 
number out of a batch of 100 numbers arbitrarily large: the mean, the median, the 
1096 trimmed mean, the standard deviation, the MAD, the interquartile range? 


Show that the median is an M estimate if V (x) = |x|. For what symmetric density 
function is this the mle of the mean? 


What proportion of the observations from a normal sample would you expect to 
be marked by an asterisk on a boxplot? 


Explain why the IQR and the MAD are divided by 1.35 and .675, respectively, 
to estimate o for a normal sample. 


For the data of Problem 6: 


. Find the mean, median, and 10% and 20% trimmed means. 

. Find an approximate 90% confidence interval for the mean. 

. Find a confidence interval with coverage near 9096 for the median. 

. Use the bootstrap to find approximate standard errors of the trimmed means. 

. Use the bootstrap to find approximate 90% confidence intervals for the trimmed 
means. 

f. Find and compare the standard deviation of the measurements, the interquartile 

range, and the MAD. 
g. Use the bootstrap to find the approximate sampling distribution and standard 
error of the upper quartile. 


oe coctm 


The Cauchy distribution has the density function 


fa== (ra) —00 <x < 00 


which is symmetric about zero. This distribution has very heavy tails, which 
cause the arithmetic mean to be a very poor estimate of location. Simulate the 
distribution of the arithmetic mean and of the median from a sample of size 25 
from the Cauchy distribution by drawing 100 samples of size 25 and compare. 
From Example B in Section 3.6.1, if Z; and Z are independent and N (0, 1), 
then their quotient follows a Cauchy distribution. (This gives a simple way of 
generating Cauchy random variables.) 
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39. 


40. 


41. 


42. 


Simiu and Filliben (1975), in a statistical analysis of extreme winds, analyzed the 
data contained in the file windspeed 10.1. Construct boxplots to examine 
and compare the forms of the distributions across cities and across years. 


Olson, Simpson, and Eden (1975) discuss the analysis of data obtained from a 
cloud seeding experiment. A cloud was deemed “seedable” if it satisfied certain 
criteria; for each seedable cloud a decision was made at random whether to actu- 
ally seed. The nonseeded clouds are referred to as control clouds. The following 
table presents the rainfall from 26 seeded and 26 control clouds. Make Q-Q plots 
for rainfall versus rainfall and log rainfall versus log rainfall. What do these plots 
suggest about the effect, if any, of seeding? 














Seeded Clouds 
129.6 31.4 2745.6 489.1 430.0 302.8 119.0 4.1 
92.4 17.5 200.7 274.7 274.7 7.7 1656.0 978.0 
198.6 703.4 1697.8 334.1 118.3 255.0 115.3 242.5 
32.7 40.6 
Control Clouds 
26.1 26.3 87.0 95.0 372.4 0.01 17.3 24.4 
11.5 321.2 68.5 81.2 47.3 28.6 830.1 345.5 
1202.6 36.6 4.9 4.9 41.1 29.0 163.0 244.3 
147.8 21.7 





Based on your results, how would you expect boxplots of precipitation from 
seeded and unseeded clouds to compare? How would you expect boxplots of log 
precipitation to compare? Make the boxplots and see whether your predictions 
are confirmed. 


Construct a nonparametric confidence interval for a quantile x, by using the same 
reasoning as in the derivation of a confidence interval for a median. 


In a study of the natural variability of rainfall, the rainfall of summer storms 
was measured by a network of rain gauges in southern Illinois for the years 
1960-1964 (Changnon and Huff, in LeCam and Neyman 1967). The average 
amount of rainfall (in inches) from each storm, by year, is contained in the files, 
Illinois60,...,1Illinoise4. 








a. Is the form of the distribution of rainfall per storm skewed or symmetric? 

b. What is the average rainfall per storm? What is the median rainfall per storm? 
Explain why these measures differ, using the results of part (a). 

c. You may have read statements like “10% of the storms account for 90% of 
the rain." Construct a graph that shows such a relationship for these data. 

d. Compare the years using boxplots. 

e. Which years were wet and which were dry? Are the wet years wet because 
there were more storms, because individual storms produced more rain, or for 
both of these reasons? 


43. 


44. 


45. 
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Barlow, Toland, and Freeman (1984) studied the lifetimes of Kevlar 49/epoxy 
strands subjected to sustained stress. (The space shuttle uses Kevlar/epoxy spher- 
ical vessels in an environment of sustained pressure.) The files kevlar70, 
kevlar80, and kevlar90 contain the times to failure (in hours) of strands 
tested at 70%, 80%, and 90% stress levels. What do these data indicate about the 
nature of the distribution of lifetimes and the effect of increasing stress? 


Hopper and Seeman (1994) studied the relationship between bone density and 
smoking among 41 pairs of middle-aged female twins. In each pair, one twin 
was a lighter smoker and one a heavier smoker, as measured by pack-years, the 
number of packages of cigarettes consumed in a year. Bone mineral density was 
measured at the lumbar spine, the femoral (hip) neck, and the femoral shaft. As 
well as smoking, other variables, such as alcohol consumption and tea and coffee 
consumption, were recorded. The data are contained in the file bonden and doc- 
umentation is in the file bonedendoc. Use graphical methods to compare bone 
densities of the heavy and light smoking twins. Do any of the other variables bear 
a relationship to bone density? After completing your analysis, you may wish to 
compare your conclusions to those in the paper. 


The 2000 U.S. Presidential election was very close and hotly contested. George 
W. Bush was ultimately appointed to the Presidency by the U.S. Supreme Court. 
Among the issues was a confusing ballot in Palm Beach County, Florida, the 
so-called Butterfly Ballot, shown in the following figure. 





























(REPUBLICAN) 
GEORGE W. BUSH-President  3— 
DICK CHENEY-Vice President (REFORM) 
— 4 PAT BUCHANAN-President 
(DEMOCRATIC) EZOLA FOSTER-Vice President 
AL GORE-President 5—— 
JOE LIBERMAN-Vice President (SOCIALIST) 
-—6 DAVID McREYNOLDS-President 
(LIBERTARIAN) MARY CAL HOLLIS-Vice President 
HARRY BROWNE-President > 
ART OLIVER-Vice President 
-— 8 
0— 
-— 10 
11 > 

















Notice that on this ballot, although the Democrats are listed in the second row 
on the left, a voter wishing to specify them would have to punch the third 
hole—punching the second hole would result in a vote for the Reform Party 
(Pat Buchanan). After the election, many distraught Democratic voters claimed 
that they had inadvertently voted for Buchanan, a right-wing candidate. 

The file PalmBeach contains relevant data: vote counts by county in Florida 
for Buchanan and for four other presidential candidates in 2000, the total vote 
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46. 


47. 


48. 


counts in 2000, the presidential vote counts for three presidential candidates in 
1996, the vote count for Buchanan in the 1996 Republican primary, the registra- 
tion in Buchanan's Reform Party, and the total registration in the county. Does 
this data support voters' claims that they were misled by the form of the ballot? 
Start by making two scatterplots: a plot of Buchanan's votes versus Bush's votes 
in 2000, and a plot of Buchanan's votes in 2000 versus his votes in the 1996 
primary. 


The file body t emp contains normal body temperature readings (degrees Fahren- 
heit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females 
(coded by 2) from Shoemaker (1996). 


a. For both males and females, make scatterplots of heart rate versus body tem- 
perature. Comment on the relationship or lack thereof. 

b. Quantify the strengths of the relationships by calculating Pearson and rank 
correlation coefficients. 

c. Does the relationship for males appear to be the same as that for females? Ex- 
amine this question graphically, by making a scatterplot showing both females 
and males and identifying females and males by different plotting symbols. 


Old Faithful geyser in Yellowstone National Park, Wyoming, derives its name 
from the regularity of its eruptions. The file oldfaithful contains measure- 
ments on eight successive days of the durations of the eruptions (in minutes) and 
the subsequent time interval before the next eruption. 


a. Use histograms of durations and time intervals as well as other graphical 
methods to examine the fidelity of Old Faithful, and summarize your findings. 

b. Is there a relationship between the durations of eruptions and the time intervals 
between them? 


In 1970, Congress instituted a lottery for the military draft to support the unpop- 
ular war in Vietnam. All 366 possible birth dates were placed in plastic capsules 
in a rotating drum and were selected one by one. Eligible males born on the first 
day drawn were first in line to be drafted followed by those born on the second 
day drawn, etc. The results were criticized by some who claimed that government 
incompetence at running a fair lottery resulted in a tendency of men born later 
in the year being more likely to be drafted. Indeed, later investigation revealed 
that the birthdates were placed in the drum by month and were not thoroughly 
mixed. The columns of the file 19701lottery are month, month number, day 
of the year, and draft number. 


a. Plot draft number versus day number. Do you see any trend? 

b. Calculate the Pearson and rank correlation coefficients. What do they 
suggest? 

c. Is the correlation statistically significant? One way to assess this is via a 
permutation test. Randomly permute the draft numbers and find the corre- 
lation of this random permutation with the day numbers. Do this 100 times 
and see how many of the resulting correlation coefficients exceed the one 
observed in the data. If you are not satisfied with 100 times, do it 1,000 
times. 
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d. Make parallel boxplots of the draft numbers by month. Do you see any 
pattern? 


49. Olive oil from Spain, Tunisia, and other countries is imported into Italy and is 


50. 


then repackaged and exported with the label “Imported from Italy.” Olive oils 
from different places have distinctive tastes. Can the oils from different regions 
and areas in Italy be distinguished based on their combinations of fatty acids? 
This question was considered by Forina et al. (1983). The data consists of the per- 
centage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, 
linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive 
oils. There are 9 collection areas, 4 from southern Italy (North and South Apulia, 
Calabria, Sicily), two from Sardinia (Inland and Coastal), and 3 from northern 
Italy (Umbria, East and West Liguria). The file olive contains the following 
variables for each of the 572 samples: 


Region: South, North, or Sardinia 

Area (subregions within the larger regions): North and South Apulia, Calabria, 
Sicily, Inland and Coastal Sardinia, Umbria, East and West Liguria 
Palmitic Acid Percentage 

Palmitoleic Acid Percentage 

Stearic Acid Percentage 

Oleic Acid Percentage 

Linoleic Acid Percentage 

Linolenic Acid Percentage 

Arachidic Acid Percentage 

Eicosenoic Acid Percentage 


Examine this data with the aim of distinguishing between regions and areas 
by using fatty acid composition. 

a. Make a table of the mean and median values of percentages for each area, 
grouping the areas within regions. 

b. Complement the analysis by making parallel boxplots. Which variables look 
promising for separating the regions? 

c. It is possible that the regions can be more clearly separated by considering 
pairs of variables. Use the variables that appear to be informative from the 
analysis up to this point to make scatterplots. How well can the regions be 
separated based on the scatterplots? 

d. How well can the areas within regions be distinguished? 

e. By interactively rotating point clouds, one can examine relationships among 
more than two variables at a time. Try this with the software ggobi available 
athttp://www.ggobi.org/. 


The file flow- occ contains data collected by loop detectors at a particular lo- 
cation of eastbound Interstate 80 in Sacramento, California, from March 14—20, 
2003. (Source: http: //pems.eecs.berkeley.edu/) For each of three 
lanes, the flow (the number of cars) and the occupancy (the percentage of time 
a car was over the loop) were recorded in successive five minute intervals. (See 
Example B of Section 10.7 for background information.) There were 1740 such 
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five-minute intervals. Lane 1 is the farthest left lane, lane 2 is in the center, and 
lane 3 is the farthest right. 


a. 


b. 


For each station, plot flow and occupancy versus time. Explain the patterns 
you see. Can you deduce from the plots what the days of the week were? 
Compare the flows in the three lanes by making parallel boxplots. Which lane 
typically serves the most traffic? 

Examine the relationships ofthe flows in the three lanes by making scatterplots. 
Can you explain the patterns you see? Are statements of the form, "The flow 
in lane 2 is typically about 50% higher than in lane 3,” accurate descriptions 
of the relationships? 

Occupancy can be viewed as a measure of congestion. Find the mean and 
median occupancy in each of the three lanes. Do you think that the distributions 
of occupancy are symmetric or skewed? Why? 

Make histograms of the occupancies, varying the number of bins. What num- 
ber of bins seems to give good representations for the shapes of the distribu- 
tions? Are there any unusual features, and if so, how might they be explained? 
Make plots to support or refute the statement, “When one lane is congested, 
the others are, too.” 

Flow can be regarded as a measure of the throughput of the system. How does 
this throughput depend on congestion? Consider the following conjecture: 
“When very few cars are on the road, flow is small and so is congestion. 
Adding a few more cars may increase congestion but not enough so that 
velocity is decreased, so flow will also increase. Beyond some point, increasing 
occupancy (congestion) will decrease velocity, but since there will then be 
more cars in total, flow will still continue to increase." Does this seem plausible 
to you? Plot flow versus occupancy for each of the three lanes. Does this 
conjecture appear to be true? Can you explain what you see? Is the relationship 
of flow to occupancy the same in all lanes? 

This and the following exercises require the use of dynamic graphics, e.g., 
http: //www.ggobi.org/. Make time series plots of all the variables. 
Consider lane 1. Make a one-dimensional display of occupancy and vary the 
smoothness until you can see some distinct modes. Use brushing to determine 
when in the time series plots those modes occured. Do the same for flow and 
then examine some other lanes. 

Choose a lane and make one-dimensional displays for flow and occupancy 
and a scatterplot of flow versus occupancy. Use brushing to simultaneously 
identify regions in the three plots. Does what you see make sense? 


. From scatterplots of flow versus occupancy, examine when different regions 


of this scatterplot occur in time. In particular, identify when in the time series 
plots the flow breaks down because a critical point is reached. 

You have now seen that all these variables, flow and occupancy in each of the 
three lanes, are closely related, but because scatterplots are two-dimensional, 
you have been able to examine only those relationships between pairs of vari- 
ables. In these scatterplots, the points tend to lie along curves. What happens 
in higher dimensions? 
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i. Examine the relationship of the three flows. In three dimensions, do the 
points tend to lie along a curve (a one-dimensional object), or do they tend 
to concentrate on a two-dimensional manifold, or are they scattered over 
three dimensions? 

ii. Examine the relationships of the three occupancies. In three dimensions, 
do the points tend to lie along a curve (a one-dimensional object), or 
do they tend to concentrate on a two-dimensional manifold, or are they 
scattered over three dimensions? 

iii. How do the points lie in six dimensions (three flows and three occupan- 
cies)? When do different regions occur in time? 


l. A taxi driver claims that when traffic breaks down, the fast lane breaks down 
first so he moves immediately to the right lane. Can you see any such phe- 
nomena in the data? 
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CHAPTER 11 


Comparing Two Samples 


Introduction 


This chapter is concerned with methods for comparing samples from distributions 
that may be different and especially with methods for making inferences about how 
the distributions differ. In many applications, the samples are drawn under different 
conditions, and inferences must be made about possible effects of these conditions. 
We will be primarily concerned with effects that tend to increase or decrease the 
average level of response. 

For example, in the end-of-chapter problems, we will consider some experiments 
performed to determine to what degree, if any, cloud seeding increases precipitation. 
In cloud-seeding experiments, some storms are selected for seeding, other storms are 
left unseeded, and the amount of precipitation from each storm is measured. This 
amount varies widely from storm to storm, and in the face of this natural variability, 
itis difficult to tell whether seeding has a systematic effect. The average precipitation 
from the seeded storms might be slightly higher than that from the unseeded storms, 
but a skeptic might not be convinced that the difference was due to anything but 
chance. We will develop statistical methods to deal with this type of problem based 
on a stochastic model that treats the amounts of precipitation as random variables. 
We will also see how a process of randomization allows us to make inferences about 
treatment effects even in the case where the observations are not modeled as samples 
from populations or probability laws. 

This chapter will be concerned with analyzing measurements that are continuous 
in nature (such as temperature); Chapter 13 will take up the analysis of qualitative 
data. This chapter will conclude with some general discussion of the design and 
interpretation of experimental studies. 
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Comparing Two Independent Samples 


In many experiments, the two samples may be regarded as being independent of each 
other. In a medical study, for example, a sample of subjects may be assigned to a 
particular treatment, and another independent sample may be assigned to a control 
(or placebo) treatment. This is often accomplished by randomly assigning individuals 
to the placebo and treatment groups. In later sections, we will discuss methods that 
are appropriate when there is some pairing, or dependence, between the samples, such 
as might occur if each person receiving the treatment were paired with an individual 
of similar weight in the control group. 

Many experiments are such that if they were repeated, the measurements would 
not be exactly the same. To deal with this problem, a statistical model is often em- 
ployed: The observations from the control group are modeled as independent random 
variables with a common distribution, F, and the observations from the treatment 
group are modeled as being independent of each other and of the controls and as 
having their own common distribution function, G. Analyzing the data thus entails 
making inferences about the comparison of F and G. In many experiments, the pri- 
mary effect of the treatment is to change the overall level of the responses, so that 
analysis focuses on the difference of means or other location parameters of F and G. 
When only a small amount of data is available, it may not be practical to do much 
more than this. 


Methods Based on the Normal Distribution 


In this section, we will assume that a sample, X1, ..., Xn, is drawn from a nor- 
mal distribution that has mean jy and variance o”, and that an independent sample, 
Yi, ..., Ym, is drawn from another normal distribution that has mean uy and the same 
variance, o2. If we think of the X's as having received a treatment and the Y's as 
being the control group, the effect of the treatment is characterized by the difference 
Ux — Hy. A natural estimate of zx — jy is X — Y; in fact, this is the mle. Since 


X — Y may be expressed as a linear combination of independent normally distributed 
random variables, it is normally distributed: 


HOMER 1 1 
tii anal e 
n m 


If o? were known, a confidence interval for uy — uy could be based on 


Z= (X — Y) - (ux — uy) 


ga] il 





which follows a standard normal distribution. The confidence interval would be of 


the form 
= 1 1 
(X — Y) + z(a/2)o4/— + — 
n m 


This confidence interval is of the same form as those introduced in Chapters 7 and 
8—a statistic (X — Y in this case) plus or minus a multiple of its standard deviation. 
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Generally, o? will not be known and must be estimated from the data by calcu- 
lating the pooled sample variance, 
20 (n — sz + (m — Ds? 
P m+n—2 





where sy = (n — 1) 35; 4 (X; — X) and similarly for sy. Note that s? is a weighted 
average of the sample variances of the X's and Y's, with the weights proportional 
to the degrees of freedom. This weighting is appropriate since if one sample is 
much larger than the other, the estimate of o? from that sample is more reliable 
and should receive greater weight. The following theorem gives the distribution of a 
statistic that will be used for forming confidence intervals and performing hypothesis 
tests. 


THEOREM A 
Suppose that X, ..., X, are independent and normally distributed random vari- 
ables with mean uy and variance o?, and that Y;,..., Y are independent and 


normally distributed random variables with mean py and variance o?, and that 
the Y; are independent of the X;. The statistic 


px Qm esa) 
1 1 
Sp Sap 
n m 





follows a t distribution with m + n — 2 degrees of freedom. 


Proof 


According to the definition of the ¢ distribution in Section 6.2, we have to 
show that the statistic is the quotient of a standard normal random variable and 
the square root of an independent chi-square random variable divided by its 
n +m — 2 degrees of freedom. First, we note from Theorem B in Section 6.3 that 
(n — 1)s}/o° and (m — 1)s;/o? are distributed as chi-square random variables 
with n — | and m — 1 degrees of freedom, respectively, and are independent since 
the X; and Y; are. Their sum is thus chi-square with m + n — 2 df. Now, we 
express the statistic as the ratio U / V, where 


(X — Y) — (Ux — My) 
1 1 


Qa = sp = 


y= (n — ib " (m — IDs 1 
g^ Ge m+n—2 


U follows a standard normal distribution and from the preceding argument V has 
the distribution of the square root of a chi-square random variable divided by its 
degrees of freedom. The independence of U and V follows from Corollary A in 
Section 6.3. u 





y= 
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It is convenient and suggestive to use the following notation for the estimated 
standard deviation (or standard error) of X — Y: 
1 1 


Sx.y = Sp F 
X-Y Da M 





A confidence interval for ux — uy follows as a corollary to Theorem A. 


COROLLARYA 


Under the assumptions of Theorem A, a 100(1 — w)% confidence interval for 
Ux — Hy is 
(X — Y) € t 44 2(0/2)5g y a 


EXAMPLE A Two methods, A and B, were used in a determination of the latent heat of fusion of 
ice (Natrella 1963). The investigators wanted to find out by how much the methods 
differed. The following table gives the change in total heat from ice at —.72°C to 
water 0°C in calories per gram of mass: 








Method A Method B 
79.98 80.02 
80.04 79.94 
80.02 79.98 
80.04 79.97 
80.03 79.97 
80.03 80.03 
80.04 79.95 
79.97 79.97 
80.05 
80.03 
80.02 
80.00 
80.02 





It is fairly obvious from the table and from boxplots (Figure 11.1) that there is a 
difference between the two methods (we will test this more formally later). If we 
assume the conditions of Theorem A, we can form a 95% confidence interval to 
estimate the magnitude of the average difference between the two methods. From the 
table, we calculate 


X, = 80.02 S, = .024 
X, = 79.98 S, = .031 
12x S2 +7 x S2 
s= ULTOR — .0007178 


Sp = .027 
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FIGURE 11.1 Boxplots of measurements of heat of fusion obtained by methods A 
and B. 


Our estimate of the average difference of the two methods is X4— Xg = .04 and its 
estimated standard error is 





Kako = 1g 
= 012 


From Table 4 of Appendix B, the .975 quantile of the ¢ distribution with 19 df 
is 2.093, so 1,9(.025) = 2.093 and the 95% confidence interval is (X4 — Xg) + 
tio (.025)sx | x. or (.015, .065). The estimated standard error and the confidence 
interval quantify the uncertainty in the point estimate X4 — Xg = .04. L| 





We will now discuss hypothesis testing for the two-sample problem. Although 
the hypotheses under consideration are different from those of Chapter 9, the general 
conceptual framework is the same (you should review that framework at this time). 
In the current case, the null hypothesis to be tested is 


Ho: Ux = Hy 


This asserts that there is no difference between the distributions of the X's and Y’s. If 
one group is a treatment group and the other a control, for example, this hypothesis 
asserts that there is no treatment effect. In order to conclude that there is a treatment 
effect, the null hypothesis must be rejected. 

There are three common alternative hypotheses for the two-sample case: 


Hi: ux F by 
Hz: Ux > uy 
Hs: Ux < Hy 


EXAMPLE B 
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The first of these is a two-sided alternative, and the other two are one-sided 
alternatives. The first hypothesis is appropriate if deviations could in principle go 
in either direction, and one of the latter two is appropriate if it is believed that any 
deviation must be in one direction or the other. In practice, such a priori informa- 
tion is not usually available, and it is more prudent to conduct two-sided tests, as in 
Example A. 
The test statistic that will be used to make a decision whether or not to reject the 
null hypothesis is 
X-Y 
t= 





Sx-Y 

The t-statistic equals the multiple of its estimated standard deviation that X — Y 
differs from zero. It plays the same role in the comparison of two samples as is 
played by the chi-square statistic in testing goodness of fit. Just as we rejected for 
large values of the chi-square statistic, we will reject in this case for extreme values 
of t. The distribution of t under Hp, its null distribution, is, from Theorem A, the f 
distribution with m 4- n — 2 degrees of freedom. Knowing this null distribution allows 
us to determine a rejection region for a test at level o, just as knowing that the null 
distribution of the chi-square statistic was chi-square with the appropriate degrees of 
freedom allowed the determination of a rejection region for testing goodness of fit. 
The rejection regions for the three alternatives just listed are 


For Hi, It > tn+m—2(a@/2) 


For A, t> tn+m—2(@) 





For H3, t < —tytm—2(@) 


Note how the rejection regions are tailored to the particular alternatives and how 
knowing the null distribution of f allows us to determine the rejection region for any 
value of a. 


Let us continue Example A. To test Ho: y4 = upg versus a two-sided alternative, we 
form and calculate the following test statistic: 


X4— Xp 


= 3.33 


From Table 4 in Appendix B, f,9(.005) = 2.861 < 3.33. The two-sided test would 
thus reject at the level œ = .01. If there were no difference in the two conditions, 
differences as large or larger than that observed would occur only with probability 
less than .01—that is, the p-value is less than .01. There is little doubt that there is a 
difference between the two methods. E 


In Chapter 9, we developed a general duality between hypothesis tests and confi- 
dence intervals. In the case of the testing and confidence interval methods considered 
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in this section, the ¢ test rejects if and only if the confidence interval does not include 
Zero (see Problem 10 at the end of this chapter). 

We will now demonstrate that the test of Ho versus H; is equivalent to a likelihood 
ratio test. (The rather long argument is sketched here and should be read with paper 
and pencil in hand.) €? is the set of all possible parameter values: 


$—í(—oo«pnx-«oo,—oo«puy < W,0<0 < œ} 


The unknown parameters are 0. = (uy, y, 0). Under Ho, 0. € wo, where o = 
{ux = uy,0 < o < co). The likelihood of the two samples X;,..., Xn, and 
Greer n is 


m 


i 1 2,2 1 
lik (ux, Hy, o? — e DUK 7 ux) o7] 
( oe ) H A 210? H A 210? 


e Q/DWj-uyY [o7] 


and the log likelihood is 





2 (m +n) (m+n) 2 
l (ux, Hy, o°) = 2 log 2z 2 logo* 
1 n m 
E»! SOX -ux - M (Yj Z 
i=l j=l 





We must maximize the likelihood under c and under and then calculate the ratio 
of the two maximized likelihoods, or the difference of their logarithms. 

Under wo, we have a sample of size m + n from a normal distribution with 
unknown mean po and unknown variance oj. The mle’s of zo and of are thus 


m 


xou. X E 
ee, (xxn) 


i=1 


SOX = fo + SOY - zi 


i=l j=l 





1 


m+n 








The corresponding value of the maximized log likelihood is, after some cancel- 
lation, 


x) m+n m+n a2 m+n 


l (Ato, 65 = 2 log 27 2 log 65 








To find the mle's fix, fiy, and 6; under 2, we first differentiate the log likelihood and 
obtain the equations 


$ 0G - Ax) =0 
i=1 


m 


X j- Ay) = 0 
j=l 


m 


m+n 1 E 5 2 
—— — Xi = f ^ Y = i = 0 
27 + 261 > fix) + X (Y; — Ry) 


j=l 
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The mle's are, therefore, 














fix =X 
By =Y 
1 n m 
a2 ~ 2 ^a Q2 
= X; — Y, — 
Pus entem ca? 
When these are substituted into the log likelihood, we obtain 
AC AS UA m+n m+n . m+n 
l (Ax. Rv, 61) = log 27r J log 6; 2 


The log of the likelihood ratio is thus 


m+n (2) 
log | —5 
2 ory 


and the likelihood ratio test rejects for large values of 





jj E- Bo)? + X; - oy? 


j=l 





UU m 


2 n 
"o MOG-XY =T 
i=l j=l 





We now find an alternative expression for the numerator of this ratio, by using 


the identities 


2 (% = feo)? = SK — XY? + 0K - fio)? 
i-i 

do) — Ao)” = 5 (Or, - YY + mF — pro)” 
j=l 


j=l 











We obtain 
fio = : (nX 4- mY) 
m+n 
n — m — 
2 m+n m+n 
Therefore, 
— 4 m(X — Y) 
fige m+n 
== a n(Y — X) 
m m+n 





The alternative expression for the numerator of the ratio is thus 


X-I aT bs a 
y y+; Jp ape 


j=l 


i=l 
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and the test rejects for large values of 


mn (X —Yy 
n" rar 230? - -Y 
i=l j=l 








or, equivalently, for large values of 


IX — Y| 


[xe «XE Le =F} 
i= j= 











which is the f statistic apart from constants that do not depend on the data. Thus, the 
likelihood ratio test is equivalent to the f test, as claimed. 

We have used the assumption that the two populations have the same variance. 
If the two variances are not assumed to be equal, a natural estimate of Var(X — Y) is 


2 2 
SE LY 
n m 


If this estimate is used in the denominator of the f statistic, the distribution of that 
statistic is no longer the f distribution. But it has been shown that its distribution can 
be closely approximated by the ¢ distribution with degrees of freedom calculated in 
the following way and then rounded to the nearest integer: 


p Lsk/n)  Gr/mP 
Gy/nY , Gy/my 


n—i m— 1 








Let us rework Example B, but without the assumption that the variances are equal. 
Using the preceding formula, we find the degrees of freedom to be 12 rather than 19. 
The ¢ statistic is 3.12. Since the .995 quantile of the ¢ distribution with 12 df is 3.055 
(Table 4 of Appendix B), the test still rejects at level a = .01. L| 


If the underlying distributions are not normal and the sample sizes are large, the 
use of the ¢ distribution or the normal distribution is justified by the central limit 
theorem, and the probability levels of confidence intervals and hypothesis tests are 
approximately valid. In such a case, however, there is little difference between the f 
and normal distributions. If the sample sizes are small, however, and the distributions 
are not normal, conclusions based on the assumption of normality may not be valid. 
Unfortunately, if the sample sizes are small, the assumption of normality cannot be 
tested effectively unless the deviation is quite gross, as we saw in Chapter 9. 


11.2.1.1 An Example—A Study of Iron Retention An experiment was per- 
formed to determine whether two forms of iron (Fe?* and Fe?*) are retained dif- 
ferently. (If one form of iron were retained especially well, it would be the better 
dietary supplement.) The investigators divided 108 mice randomly into 6 groups of 
18 each; 3 groups were given Fe?+ in three different concentrations, 10.2, 1.2, and 
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.3 millimolar, and 3 groups were given Fe?* at the same three concentrations. The 
mice were given the iron orally; the iron was radioactively labeled so that a counter 
could be used to measure the initial amount given. At a later time, another count was 
taken for each mouse, and the percentage of iron retained was calculated. The data for 
the two forms of iron are listed in the following table. We will look at the data for the 
concentration 1.2 millimolar. (In Chapter 12, we will discuss methods for analyzing 
all the groups simultaneously.) 








Fe?* Fe?* 

10.2 1:2 3 10.2 1.2 3 

71 2.20 2.25 2.20 4.04 2.71 
1.66 2.93 3.93 2.69 4.16 5.43 
2.01 3.08 5.08 3.54 4.42 6.38 
2.16 3.49 5.82 3.75 4.93 6.38 
2.42 4.11 5.84 3.83 5.49 8.32 
2.42 4.95 6.89 4.08 S 9.04 
2.56 5.16 8.50 4.27 5.86 9.56 
2.60 5.54 8.56 4.53 6.28 10.01 
3.31 5.68 9.44 5.32 6.97 10.08 
3.64 6.25 10.52 6.18 7.06 10.62 
3.74 7.25 13.46 6.22 7.78 13.80 
3.74 7.90 13.57 6.33 9.23 15.99 
4.39 8.85 14.76 6.97 9.34 17.90 
4.50 11.96 16.41 6.97 9.9] 18.25 
5.07 15.54 16.96 7.52 13.46 19.32 
5.26 15.89 17.56 8.36 18.4 19.87 
8.15 18.3 22.82 11.65 23.89 21.60 
8.24 18.59 29.13 12.45 26.39 22.25 





Asasummary of the data, boxplots (Figure 11.2) show that the data are quite skewed to 
the right. This is not uncommon with percentages or other variables that are bounded 
below by zero. Three observations from the Fe?* group are flagged as possible outliers. 
The median of the Fe?* group is slightly larger than the median of the Fe?* groups, 
but the two distributions overlap substantially. 

Another view of these data is provided by normal probability plots (Figure 11.3). 
These plots also indicate the skewness of the distributions. We should obviously 
doubt the validity of using normal distribution theory (for example, the ¢ test) for this 
problem even though the combined sample size is fairly large (36). 

The mean and standard deviation of the Fe?* group are 9.63 and 6.69; for the 
Fe?* group, the mean is 8.20 and the standard deviation is 5.45. To test the hypothesis 
that the two means are equal, we can use a t test without assuming that the population 
standard deviations are equal. The approximate degrees of freedom, calculated as 
described at the end of Section 11.2.1, are 32. The t statistic is .702, which corresponds 
to a p-value of .49 for a two-sided test; if the two populations had the same mean, 
values of the ¢ statistic this large or larger would occur 49% of the time. There is thus 
insufficient evidence to reject the null hypothesis. A 9596 confidence interval for the 


430 


Chapter 11 


Comparing Two Samples 








30+ 

25+ 
3 20 
E 7 T 
p 1 
o : 
a ISP : 
= - : 
9 ' i 
D i 
& 10- | 

5 | Li 

0 

Fe2* Fe?* 


FIGURE 11.2  Boxplots of the percentages of iron retained for the two forms. 
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FIGURE 11.3 Normal probability plots of iron retention data. 


difference of the two population means is (—2.7, 5.6). But the t test assumes that the 
underlying populations are normally distributed, and we have seen there is reason to 
doubt this assumption. 

Itis sometimes advocated that skewed data be transformed to a more symmetric 
shape before normal theory is applied. Transformations such as taking the log or 
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the square root can be effective in symmetrizing skewed distributions because they 
spread out small values and compress large ones. Figures 11.4 and 11.5 show boxplots 
and normal probability plots for the natural logs of the iron retention data we have 
been considering. The transformation was fairly successful in symmetrizing these 
distributions, and the probability plots are more linear than those in Figure 11.3, 
although some curvature is still evident. 
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FIGURE 11.4 Boxplots of natural logs of percentages of iron retained. 
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FIGURE 11.5 Normal probability plots of natural logs of iron retention data. 
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The following model is natural for the log transformation: 


X; = ux(l + 6), DU Bgmon 
Y; = uy(l-Fj), J= 
log X; = log ux + log(1 + £;) 
log Y; = log vy + log(1 + ô;) 


Here the e; and 4; are independent random variables with mean zero. This model 
implies that if the variances of the errors are o?, then 


E(Xi) = ux 
E(Y;) = Py 
Ox = xo 
Oy = Hyo 
or that 
Ox Oy 
ux Hy 


If the £; and ô; have the same distribution, Var(log X) = Var(log Y). The ratio of the 
standard deviation of a distribution to the mean is called the coefficient of variation 
(CV); it expresses the standard deviation as a fraction of the mean. Coefficients of 
variation are sometimes expressed as percentages. For the iron retention data we have 
been considering, the CV's are .69 and .67 for the Fe?* and Fe?* groups; these values 
are quite close. These data are quite *noisy"—the standard deviation is nearly 70% 
of the mean for both groups. 

For the transformed iron retention data, the means and standard deviations are 
given in the following table: 





| Fe2+ Fe3+ 
Mean 2.09 1.90 
Standard Deviation .659 .574 


For the transformed data, the ¢ statistic is .917, which gives a p-value of .37. 
Again, there is no reason to reject the null hypothesis. A 95% confidence interval is 
(—.61, .23). Using the preceding model, this is a confidence interval for 


log ux — log uy = log (=) 


HY 
The interval is 


—.61 < log (=) « 23 
Ly 


or 
ne 
^ fly 
Other transformations, such as raising all values to some power, are sometimes 
used. Attitudes toward the use of transformations vary: Some view them as a very 


< 1.26 
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useful tool in statistics and data analysis, and others regard them as questionable 
manipulation of the data. 


Power 


Calculations of power are an important part of planning experiments in order to 
determine how large sample sizes should be. The power of a test is the probability 
of rejecting the null hypothesis when it is false. The power of the two-sample f test 
depends on four factors: 


1. The real difference, A = |ux — uyl. The larger this difference, the greater the 
power. 

2. The significance level o at which the test is done. The larger the significance level, 
the more powerful the test. 

3. The population standard deviation ø, which is the amplitude of the “noise” that 
hides the “signal.” The smaller the standard deviation, the larger the power. 

4. The sample sizes n and m. The larger the sample sizes, the greater the power. 


Before continuing, you should try to understand intuitively why these statements are 
true. We will express them quantitatively below. 

The necessary sample sizes can be determined from the significance level of the 
test, the standard deviation, and the desired power against an alternative hypothesis, 


Ay: ux- by =A 


To calculate the power of a f test exactly, special tables of the noncentral t 
distribution are required. But if the sample sizes are reasonably large, one can perform 
approximate power calculations based on the normal distribution, as we will now 
demonstrate. 

Suppose that o, œ, and A are given and that the samples are both of size n. Then 


n 


Var(X — Y) = o? G + z) 
n 


The test at level œ of Ho: ux = Hy against the alternative Hi: wy # Hy is based on 
the test statistic 
B. 


o/2/n 





The rejection region for this test is |Z| > z(@/2), or 


IX — Y| > «ere? 
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The power of the test if ux — uy = A is the probability that the test statistic falls in 
the rejection region, or 


IX —Y|> seme? 
n 
"A "a = 'H 
X — Y > z(a/2)o4/ — X — Y < —z(a/2)o0 4] — 
n n 


since the two events are mutually exclusive. Both probabilities on the right-hand side 
are calculated by standardizing. For the first one, we have 


P 








LP +P 

















p X-F > zame] -p DINE E EUIS 
n co 2/n o/2/n 
A a 
= 1-9 }z(a/2) — —4/ = 
ao V2 





where Ẹ is the standard normal cdf. Similarly, the second probability is 


A [n 
o |-:«m- = 1 


Thus, the probability that the test statistic falls in the rejection region is equal to 


A Jn 
z(a/2) — ^ s 


Typically, as A moves away from zero, one of these terms will be negligible with 
respect to the other. For example, if A is greater than zero, the first term will be 
dominant. For fixed n, this expression can be evaluated as a function of A; or for 
fixed A, it can be evaluated as a function of n. 


1-6 Te 


2 











A^ Jn 
— z(a/2) — = | 


As an example, let us consider a situation similar to an idealized form of the iron 
retention experiment. Assume that we have samples of size 18 from two normal 
distributions whose standard deviations are both 5, and we calculate the power for 
various values of A when the null hypothesis is tested at a significance level of .05. 
The results of the calculations are displayed in Figure 11.6. We see from the plot that 
if the mean difference in retention is only 1%, the probability of rejecting the null 
hypothesis is quite small, only 9%. A mean difference of 5% in retention rate gives 
a more satisfactory power of 85%. 

Suppose that we wanted to be able to detect a difference of A = 1 with probability 
.9. What sample size would be necessary? Using only the dominant term in the 
expression for the power, the sample size should be such that 


A^ [n 
»(is- 2 r) =.1 
ao V2 
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Power 








FIGURE 11.6 Plot of power versus A. 


From the tables for the normal distribution, .1 = ^(— 1.28), so 


A^ [n 
1.96 — ^G- —1.28 
o V2 


Solving for n, we find that the necessary sample size would be 525! This is clearly 
unfeasible; if in fact the experimenters wanted to detect such a difference, some 
modification of the experimental technique to reduce o would be necessary. E 


A Nonparametric Method—The Mann-Whitney Test 


Nonparametric methods do not assume that the data follow any particular distribu- 
tional form. Many of them are based on replacement of the data by ranks. With this 
replacement, the results are invariant under any monotonic transformation; in com- 
parison, we saw that the p-value of a t test may change if the log of the measurements 
is analyzed rather than the measurements on the original scale. Replacing the data by 
ranks also has the effect of moderating the influence of outliers. 

For purposes of discussion, we will develop the Mann-Whitney test (also some- 
times called the Wilcoxon rank sum test) in a specific context. Suppose that we have 
m + n experimental units to assign to a treatment group and a control group. The 
assignment is made at random: n units are randomly chosen and assigned to 
the control, and the remaining m units are assigned to the treatment. We are in- 
terested in testing the null hypothesis that the treatment has no effect. If the null 
hypothesis is true, then any difference in the outcomes under the two conditions is 
due to the randomization. 
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A test statistic is calculated in the following way. First, we group all m + n 
observations together and rank them in order of increasing size (we will assume for 
simplicity that there are no ties, although the argument holds even in the presence 
of ties). We next calculate the sum of the ranks of those observations that came 
from the control group. If this sum is too small or too large, we will reject the null 
hypothesis. 

Itis easiest to see how the procedure works by considering a very small example. 
Suppose that a treatment and a control are to be compared: Of four subjects, two are 
randomly assigned to the treatment and the other two to the control, and the following 
responses are observed (the ranks of the observations are shown in parentheses): 








Treatment Control 
1d) 6 (4) 
3 (2) 4 (3) 





The sum of the ranks of the control group is R = 7, and the sum of the ranks 
of the treatment group is 3. Does this discrepancy provide convincing evidence of a 
systematic difference between treatment and control, or could it be just due to chance? 
To answer this question, we calculate the probability of such a discrepancy if the 
treatment had no effect at all, so that the difference was entirely due to the particular 
randomization—this is the null hypothesis. The key idea of the Mann-Whitney test is 
that we can explicitly calculate the distribution of R under the null hypothesis, since 
under this hypothesis every assignment of ranks to observations is equally likely and 
we can enumerate all 4! = 24 such assignments. In particular, each of the (3) =6 
assignments of ranks to the control group shown in the following table is equally 
likely: 





Ranks 


{1, 2} 
{1, 3} 
{1, 4} 
{2, 3} 
{2, 4} 
{3, 4} 


n 





ANNAN BW 





From this table, we see that under the null hypothesis, the distribution of R (its null 
distribution) is: 


r | 3 4 5 6 7 





pean [bo of 31 d d 


In particular, P(R — 7) — e so this discrepancy would occur one time out of six 
purely on the basis of chance. 


EXAMPLEA 


11.2 Comparing Two Independent Samples 437 


The small example of the previous paragraph has been laid out for pedagogi- 
cal reasons, the point being that we could in principle go through similar calcula- 
tions for any sample sizes m and n. Suppose that there are n observations in the 
treatment group and m in the control group. If the null hypothesis holds, every as- 
signment of ranks to the m 4- n observations is equally likely, and hence each of 
the ("7") possible assignments of ranks to the control group is equally likely. For 
each of these assignments, we can calculate the sum of the ranks and thus deter- 
mine the null distribution of the test statistic—the sum of the ranks of the control 
group. 

Itis important to note that we have not made any assumption that the observations 
from the control and treatment groups are samples from a probability distribution. 
Probability has entered in only as a result of the random assignment of experimental 
units to treatment and control groups (this is similar to the way that probability 
enters into survey sampling). We should also note that, although we chose the sum 
of control ranks as the test statistic, any other test statistic could have been used and 
its null distribution computed in the same fashion. The rank sum is easy to compute 
and is sensitive to a treatment effect that tends to make responses larger or smaller. 
Also, its null distribution has to be computed only once and tabled; if we worked with 
the actual numerical values, the null distribution would depend on those particular 
values. 

Tables of the null distribution of the rank sum are widely available and vary in 
format. Note that because the sum of the two rank sums is the sum of the integers 
from 1 to m + n, which is [(m + n) (m + n + 1)/2], knowing one rank sum tells us 
the other. Some tables are given in terms of the rank sum of the smaller of the two 
groups, and some are in terms of the smaller of the two rank sums (the advantage of 
the latter scheme is that only one tail of the distribution has to be tabled). Table 8 of 
Appendix B makes use of additional symmetries. Let nı be the smaller sample size 
and let R be the sum of the ranks from that sample. Let R" = ni(m +n + 1) — R 
and R* = min(R, R’). The table gives critical values for R*. (Fortunately, such fussy 
tables are largely obsolete with the increasing use of computers.) 

When it is more appropriate to model the control values, X;, ..., Xn, as a sample 
from some probability distribution F and the experimental values, Yi,...,Y,, asa 
sample from some distribution G, the Mann-Whitney testis a test of the null hypothesis 
Ho: F = G. The reasoning is exactly the same: Under Ho, any assignment of ranks 
to the pooled m + n observations is equally likely, etc. 

We have assumed here that there are no ties among the observations. If there 
are only a small number of ties, tied observations are assigned average ranks (the 
average of the ranks for which they are tied); the significance levels are not greatly 
affected. 


Let us illustrate the Mann-Whitney test by referring to the data on latent heats of fusion 
of ice considered earlier (Example A in Section 11.2.1). The sample sizes are fairly 
small (13 and 8), so in the absence of any prior knowledge concerning the adequacy 
of the assumption of a normal distribution, it would seem safer to use a nonparametric 
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method. The following table exhibits the ranks given to the measurements for each 
method (refer to Example A in Section 11.2.1 for the original data): 








Method A Method B 
7.5 11.5 
19.0 1.0 
11:5 7,3 
19.0 4.5 
15:5 4.5 
15.5 15.5 
19.0 2.0 
4.5 4.5 
21.0 
15.5 
11.5 
9.0 
11.5 





Note how the ties were handled. For example, the four observations with the value 
79.97 tied for ranks 3, 4, 5, and 6 were each assigned the rank of 4.5 = (3 + 4 + 
5 + 6)/4. 

Table 8 of Appendix B is used as follows. The sum of the ranks of the smaller 
sample is R = 51. 


R'28(8-13-D — R 
— 125 


Thus, R* = 51. From the table, 53 is the critical value for a two-tailed test with 
a = .01, and 60 is the critical value for œ = .05. The Mann-Whitney test thus rejects 
at the .01 significance level. E 


Let Ty denote the sum of the ranks of Y1, Y2,..., Ym. Using results from Chap- 
ter 7, we can easily find E(Ty) and Var(Ty) under the null hypothesis F — G. 


THEOREM A 
lit I8 = G, 
mm+n+1 
pay = eim 
1 
va 


I 
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Proof 
Under the null hypothesis, Ty is the sum of a random sample of size m drawn 
without replacement from a population consisting of the integers (1,2, ..., 


m +n}. Ty thus equals m times the average of such a sample. From Theorems A 
and B of Section 7.3.1, 


E(Ty) — mu 


N—m 
Var(Ty) 2 mo? 
ar(Ty) = mo (Z=) 


where N = m + n is the size of the population, and jz and o? are the population 
mean and variance. Now, using the identities 





N 


N(N 4 1) 
2 E a 





k= 
z _ NW+DQON +1) 
7 6 


we find that for the population (1, 2, ..., m +n} 





Nal 
ROT 
2 N?—1 
12 
The result then follows after algebraic simplification. B 


Unlike the ¢ test, the Mann-Whitney test does not depend on an assumption of 
normality. Since the actual numerical values are replaced by their ranks, the test is 
insensitive to outliers, whereas the f test is sensitive. It has been shown that even 
when the assumption of normality holds, the Mann-Whitney test is nearly as pow- 
erful as the ¢ test and it is thus generally preferable, especially for small sample 
sizes. 

The Mann-Whitney test can also be derived starting from a different point of 
view. Suppose that the X's are a sample from F and the Y's a sample from G, and 
consider estimating, as a measure of the effect of the treatment, 


m= P(X «Y) 


where X and Y are independently distributed with distribution functions F and G, 
respectively. The value z is the probability that an observation from the distribution 
F is smaller than an independent observation from the distribution G. 

If, for example, F and G represent lifetimes of components that have been man- 
ufactured according to two different conditions, z is the probability that a component 
of one type will last longer than a component of the other type. An estimate of z can 
be obtained by comparing all n values of X to all m values of Y and calculating the 
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proportion of the comparisons for which X was less than Y: 


n m 


t= 5 


i=l j=} 


where 


47 00, otherwise 


To see the relationship of Ê to the rank sum introduced earlier, we will find it conve- 
nient to work with 


V, = 1, if Xa < Yj 
J 10, otherwise 


Clearly, 


since the Vj; are just a reordering of the Z;;. Also, 


n m 


5 5 Vi; = (number of X's that are less than Y) 
bet + (number of X's that are less than Y) 


+--++-+ (number of X's that are less than Y) 


If the rank of Y) in the combined sample is denoted by R,,, then the number of X's 
less than Ya) is Ry, — 1, the number of X's less than Yo, is Ry? — 2, etc. Therefore, 


n m 


XOY Vy = (Ry D (Ryo — 2) cc + (Rym m) 


i=l j=l 


=e Di 


i=1 


=a 
N m(m + 1) 
== 5 


Thus, £ may be expressed in terms of the rank sum of the Y’s (or in terms of the rank 
sum of the X's, since the two rank sums add up to a constant). 


EXAMPLE B 
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From Theorem A, we have 


COROLLARY A 
Under the null hypothesis Ho: F = G, 


mn 
E(Uy) — E 
VENUES mn(m 4 n 4 1) = 


12 


For m and n both greater than 10, the null distribution of Uy is quite well ap- 
proximated by a normal distribution, 


Uy — E(Uy) 
/ Var(Uy) 


(Note that this does not follow immediately from the ordinary central limit theorem; 
although Uy is a sum of random variables, they are not independent.) Similarly, the 
distribution of the rank sum of the X's or Y's may be approximated by a normal 
distribution, since these rank sums differ from Uy only by constants. 


^ N(0, 1) 


Referring to Example A, let us use a normal approximation to the distribution of the 
rank sum from method B. For n — 13 and m — 8, we have from Corollary A that 
under the null hypothesis, 








8(8 4-131 
Eq) = SE t UD L gg 
8x 13(8 +1341 
or d x = Ld UE. 


T is the sum of the ranks from method B, or 51, and the normalized test statistic is 


TET) = 566 


OT 


From the tables of the normal distribution, this corresponds to a p-value of .007 for 
a two-sided test, so the null hypothesis is rejected at level a = .O1, just as it was 
when we used the exact distribution. For this set of data, we have seen that the t 
test with the assumption of equal variances, the f test without that assumption, the 
exact Mann-Whitney test, and the approximate Mann-Whitney test all reject at level 
a= Ol. a 
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The Mann-Whitney test can be inverted to form confidence intervals. Let us 
consider a “shift” model: G(x) = F(x — A). This model says that the effect of the 
treatment (the Y's) is to add a constant A to what the response would have been with 
no treatment (the X's). (This is a very simple model, and we have already seen cases 
for which it is not appropriate.) We now derive a confidence interval for A. To test 
Ho: F = G, we used the statistic Uy equal to the number of the X; — Y; that are 
less than zero. To test the hypothesis that the shift parameter is ^, we can similarly 
use 


It can be shown that the null distribution of Uy (A) is symmetric about mn/2: 
n mn 


P (Ura) = Es +k) zd (uve) => —k) 


for all integers k. Suppose that k = k(q@) is such that P(k < Uy(A) € mn — k) = 
1 — a; the level o test then accepts for such Uy (A). By the duality of confidence 
intervals and hypothesis tests, a 100(1 — œ)% confidence interval for A is thus 


C={A|k € Uy(A) <mn—k} 


C consists of the set of values A for which the null hypothesis would not be rejected. 
We can find an explicit form for this confidence interval. Let Da), Do), ..., Donn) 
denote the ordered mn differences Y; — X;. We will show that 


C= [Diw Donn—k-+1)) 
To see this, first suppose that A = Dy). Then 


Uy(A) = 4 — Yj +A <0) 


=mn—k 
Similarly, if A = Do ki, 


=k 


(You might find it helpful to consider the case m = 3, n = 2, k = 2.) 


We return to the data on iron retention (Section 11.2.1.1). The earlier analysis using 
the ż test rested on the assumption that the populations were normally distributed, 
which, in fact, seemed rather dubious. The Mann-Whitney test does not make this 
assumption. The sum of the ranks of the Fe?* group is used as a test statistic (we 
could have as easily used the U statistic). The rank sum is 362. Using the normal 
approximation to the null distribution of the rank sum, we get a p-value of .36. Again, 
there is insufficient evidence to reject the null hypothesis that there is no differential 
retention. The 9596 confidence interval for the shift between the two distributions is 
(—1.6, 3.7), which overlaps zero substantially. Note that this interval is shorter than 
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the interval based on the ¢ distribution; the latter was inflated by the contributions of 
the large observations to the sample variance. E 


We close this section with an illustration of the use of the bootstrap in a two- 
sample problem. As before, suppose that Xj, X2,..., Xn and Yi, Yo, ..., Ym are 
two independent samples from distributions F and G, respectively, and that 7 = 
P(X < Y) is estimated by 7. How can the standard error of ĉ be estimated and how 
can an approximate confidence interval for 2 be constructed? (Note that the calcula- 
tions of Theorem A are not directly relevant, since they are done under the assump- 
tion that F = G.) 

The problem can be approached in the following way: First suppose for the 
moment that F and G were known. Then the sampling distribution of 7 and its 
standard error could be estimated by simulation. A sample of size n would be generated 
from F, an independent sample of size m would be generated from G, and the resulting 
value of # would be computed. This procedure would be repeated many times, say B 
times, producing 7, £i», ..., £g. A histogram of these values would be an indication 
of the sampling distribution of £t and their standard deviation would be an estimate 
of the standard error of 7. 

Of course, this procedure cannot be implemented, because F and G are not 
known. But as in the previous chapter, an approximation can be obtained by using the 
empirical distributions F, and G, in their places. This means that a bootstrap value of 
ft is generated by randomly selecting n values from X1, X2, ..., X, with replacement, 
m values from Yi, Y2, ..., Ym with replacement and calculating the resulting value 
of ft. In this way, a bootstrap sample 71, 72, ..., Êg is generated. 


Bayesian Approach 


We consider a Bayesian approach to the model, which stipulates that the X; are i.i.d. 
normal with mean jy and precision £; and the Y; are i.i.d. normal with mean py, 
precision £, and independent of the X;. In general, a prior joint distribution assigned 
to (Ux, Hy, &) would be multiplied by the likelihood and normalized to integrate 
to 1 to produce a three-dimensional joint posterior distribution for (ux, uy, €). The 
marginal joint distribution of (ux, uy) could be obtained by integrating out £. The 
marginal distribution of ux — uy could then be obtained by another integration as in 
Section 3.6.1. Several integrations would thus have to be done, either analytically or 
numerically. Special Monte Carlo methods have been devised for high dimensional 
Bayesian problems, but we will not consider them here. 

An approximate result can be obtained using improper priors. We take (jx, Hy, ) 
to be independent. The means ux and uy are given improper priors that are constant 
on (—oo, oo), and £ is given the improper prior fg(€) = &^!. The posterior is thus 
proportional to the likelihood multiplied by £^: 


m+n B m 
Soost (Hex, Ly, Ẹ) ex £'T-lexp (-: 2 xe — ux) + Yo = a) 


j=l 
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Next, using 37 (x; — ux? = (n — 1)s2 - n(ux — X)? and the analogous expression 
for the y;, we have 


Soost (Hex HY, £) xX ET exp (-$ [(n = 1s? t (m = Ds) 


xew(- Sus x) exp (- s - 9) 


From the form of this expression as a function of ux and uy, we see that for fixed £, 
Hx and uy are independent normally distributed with means x and y and precisions 
né and mé. Their difference, px — uy, is thus normally distributed with mean X — y 
and variance £^! (n^! + m^). 

With further analysis similar to that of Section 8.6, it can be shown that the 
marginal posterior distribution of A = ux — fy can be related to the t distribution: 


^-G-3 _, 
5, Jn tmi n+m—2 

Although formally similar to Theorem A of Section 11.2.1, the interpretation is dif- 
ferent: x — y and s, are random in Theorem A but are fixed here, and A = jux — uy 
is random here but fixed in Theorem A. The Bayesian formalism makes probability 
statements about A given the observed data. 

The posterior probability that A > Ocan thus be found using the ¢ distribution. Let 
T denote a random variable with a tm+}n-2 distribution. Then, denoting the observations 
by X and Y 





Pa sine r(A E uide xr) 


> 
Sn imo © syn + m! 


-PÍT- LI NA 
Spy n 1 +m! 
Letting X denote the measurements of method A, and Y denote the measurements 
of method B in Example A of Section 11.2.1, we find that for that example, 


P(A > 0|X, Y) = fjg(—3.33) = .998 


This posterior probability is very close to 1.0, and there is thus little doubt that the 
mean of method A is larger than the mean of method B. 

The confidence interval calculated in Section 11.2.1 is formally similar but has 
a different interpretation under the Bayesian model, which concludes that 


P(.015 < A x .065|X, Y) = .95 


by integration of the posterior ¢ distribution over a region containing 95% of the 
probability. 


Comparing Paired Samples 


In Section 11.2, we considered the problem of analyzing two independent samples. 
In many experiments, the samples are paired. In a medical experiment, for example, 
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subjects might be matched by age or weight or severity of condition, and then one 
member of each pair randomly assigned to the treatment group and the other to the 
control group. In a biological experiment, the paired subjects might be littermates. 
In some applications, the pair consists of a “before” and an "after" measurement on 
the same object. Since pairing causes the samples to be dependent, the analysis of 
Section 11.2 does not apply. 

Pairing can be an effective experimental technique, as we will now demonstrate 
by comparing a paired design and an unpaired design. First, we consider the paired 
design. Let us denote the pairs as (X;, Y;), where i = 1,...,, and assume the X's 
and Y's have means py and py and variances o Z and az. We will assume that different 
pairs are independently distributed and that Cov(X;, Y;) = oxy. We will work with 
the differences D; — X; — Y;, which are independent with 


E(Di) = ux — by 
Var(D;) = 02 + 02 — 2oxy 
= 0; + 0; — 2poxoy 


when p is the correlation of members of a pair. A natural estimate of yy — Hy is 


D - x — Y, the average difference. From the properties of D, it follows that 
E(D) 2 ux — uy 
Var(D) — - (oy + oy — 2poxoy) 
Suppose, on the other hand, that an experiment had been done by taking a sample 


of n X's and an independent sample of n Y’s. Then uy — uy would be estimated by 
X — Y and 


EX —Y) = ux — uv 
Tm 1 
Var(X — Y) = — (oy + oy) 
n 
Comparing the variances of the two estimates, we see that the variance of D is 
smaller if the correlation is positive—that is, if the X’s and Y’s are positively cor- 
related. In this circumstance, pairing is the more effective experimental design. In 


the simple case in which oy = oy = o, the two variances may be more simply 
expressed as 


207(1 — p) 


n 


Var(D) — 


in the paired case and as 
"m 20? 
Var(X — Y) 2 — 
n 
in the unpaired case, and the relative efficiency is 


Va(D) —— 
Va(X—Y) 
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If the correlation coefficient is .5, for example, a paired design with n pairs of subjects 
yields the same precision as an unpaired design with 27 subjects per treatment. This 
additional precision results in shorter confidence intervals and more powerful tests if 
the degrees of freedom for estimating o? are sufficiently large. 

We next present methods based on the normal distribution for analyzing data 
from paired designs and then a nonparametric, rank-based method. 


Methods Based on the Normal Distribution 


In this section, we assume that the differences are a sample from a normal distribution 
with 


E(Di) = ux — by = Lp 
Var(D;) = o2 


Generally, op will be unknown, and inferences will be based on 





which follows a f distribution with n — 1 degrees of freedom. Following familiar 
reasoning, a 100(1 — a)% confidence interval for up is 


D $ tn—1 (a/2)s5 


A two-sided test of the null hypothesis Ho: up = O (the natural null hypothesis for 
testing no treatment effect) at level o has the rejection region 


|D| > t, -1(a/2)s5 


If the sample size n is large, the approximate validity of the confidence interval 
and hypothesis test follows from the central limit theorem. If the sample size is small 
and the true distribution of the differences is far from normal, the stated probability 
levels may be considerably in error. 


To study the effect of cigarette smoking on platelet aggregation, Levine (1973) drew 
blood samples from 11 individuals before and after they smoked a cigarette and 
measured the extent to which the blood platelets aggregated. Platelets are involved 
in the formation of blood clots, and it is known that smokers suffer more often 
from disorders involving blood clots than do nonsmokers. The data are shown in 
the following table, which gives the maximum percentage of all the platelets that 
aggregated after being exposed to a stimulus. 
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Before After Difference 
25 27 2 
25 29 4 
27 37 10 
44 56 12 
30 46 16 
67 82 15 
53 57 4 
53 80 27 
52 61 9 
60 59 —1 
28 43 15 





From the column of differences, D = 10.27 and s5 = 2.40. The uncertainty 
in D is quantified in Sp or in a confidence interval. Since tio(.05) = 1.812, a 90% 
confidence interval is D + 1.81255, or (5.9, 14.6). We can also formally test the null 
hypothesis that means before and after are the same. The ¢ statistic is 10.27/2.40 = 
4.28, and since tjo(.005) = 3.169, the p-value of a two-sided test is less than .01. 
There is little doubt that smoking increases platelet aggregation. 

The experiment was actually more complex than we have indicated. Some sub- 
jects also smoked cigarettes made of lettuce leaves and “smoked” unlit cigarettes. 
(You should reflect on why these additional experiments were done.) 

Figure 11.7 is a plot of the after values versus the before values. They are corre- 
lated, with a correlation coefficient of .90. Pairing was a natural and effective exper- 
imental design in this case. E 





90 F 


80 F s 


70 


60 F S $ 


After 


50r 


40 + 


30r . 





20 1 1 1 1 1 
20 30 40 50 60 70 


Before 





FIGURE 11.7 Plot of platelet aggregation after smoking versus aggregation before 
smoking. 
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A Nonparametric Method— The Signed Rank Test 


A nonparametric test based on ranks can be constructed for paired samples. We 
illustrate the calculation with a very small example. Suppose there are four pairs, 
corresponding to “before” and “after” measurements listed in the following table: 








Before After Difference |Difference| Rank Signed Rank 
25 27 2 2 2 2 
29 25 —4 4 3 = 
60 59 —1 1 1 —1 
27 37 10 10 4 























The test statistic is calculated by the following steps: 


1. Calculate the differences, D;, and the absolute values of the differences and rank 
the latter. 
. Restore the signs of the differences to the ranks, obtaining signed ranks. 
3. Calculate W,, the sum of those ranks that have positive signs. For the table, this 
sum is W} =2+4=6. 


N 


The idea behind the signed rank test (sometimes called the Wilcoxon signed rank 
test) is intuitively simple. If there is no difference between the two paired conditions, 
we expect about half the D; to be positive and half negative, and W, will not be too 
small or too large. If one condition tends to produce larger values than the other, W, 
will tend to be more extreme. We therefore can use W} as a test statistic and reject 
for extreme values. 

Before continuing, we need to specify more precisely the null hypothesis we are 
testing with the signed rank test: Ho states that the distribution of the D; is symmetric 
about zero. This will be true if the members of pairs of experimental units are assigned 
randomly to treatment and control conditions, and the treatment has no effect at all. 

As usual, in order to define a rejection region for a test at level o, we need to 
know the sampling distribution of W, if the null hypothesis is true. The rejection 
region will be located in the tails of this null distribution in such a way that the 
test has level æ. The null distribution may be calculated in the following way. If Ho 
is true, it makes no difference which member of the pair corresponds to treatment 
and which to control. The difference X; — Y; — Dj has the same distribution as the 
difference Y; — X; = — D;, so the distribution of D; is symmetric about zero. The kth 
largest value of D is thus equally likely to be positive or negative, and any particular 
assignment of signs to the integers 1, ..., n (the ranks) is equally likely. There are 2" 
such assignments, and for each we can calculate W, . We obtain a list of 2" values (not 
all distinct) of W,, each of which occurs with probability 1/2". The probability of 
each distinct value of W, may thus be calculated, giving the desired null distribution. 

The preceding argument has assumed that the D; are a sample from some con- 
tinuous probability distribution. If we do not wish to regard the X; and Y; as random 
variables and if the assignments to treatment and control have been made at random, 
the hypothesis that there is no treatment effect may be tested in exactly the same 
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manner, except that inferences are based on the distribution induced by the random- 
ization, as was done for the Mann-Whitney test. 

The null distribution of W, is calculated by many computer packages, and tables 
are also available. 

The signed rank test is a nonparametric version of the paired sample f test. 
Unlike the f test, it does not depend on an assumption of normality. Since differences 
are replaced by ranks, it is insensitive to outliers, whereas the f test is sensitive. It 
has been shown that even when the assumption of normality holds, the signed rank 
test is nearly as powerful as the t test. The nonparametric method is thus generally 
preferable, especially for small sample sizes. 


The signed rank test can be applied to the data on platelet aggregation considered 
previously (Example A in Section 11.3.1). In this case, it is easier to work with W_ 
rather than W}, since W_ is clearly 1. From Table 9 of Appendix B, the two-sided 
test is significant ata = .01. L| 


If the sample size is greater than 20, a normal approximation to the null distri- 
bution can be used. To find this, we calculate the mean and variance of W,. 


THEOREM A 


Under the null hypothesis that the D; are independent and symmetrically dis- 
tributed about zero, 





gg) = Met 
veri) = PED ED 


Proof 


To facilitate the calculation, we represent W, in the following way: 


W= 5335 
k= 


where 
Il: if the kth largest | D;| has D; > 0 
= : 
0, otherwise 


Under Ho, the J, are independent Bernoulli random variables with p = 5, so 


1 
2? 


EU) = 


Ble MI 


Var (Ik) = 
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We thus have 





1 n(n + 1) 
E(W,.) = = k= 
[es n(n 4- 1)(2n+ 1) 
Var(W,) = — ea EAT 
a 12. 24 
as was to be shown. " 


If some of the differences are equal to zero, the most common technique is to 
discard those observations. If there are ties, each | D;| is assigned the average value of 
the ranks for which it is tied. If there are not too many ties, the significance level of 
the test is not greatly affected. If there are a large number of ties, modifications must 
be made. For further information on these matters, see Hollander and Wolfe (1973) 
or Lehmann (1975). 


An Example—Measuring Mercury Levels in Fish 


Kacprzak and Chvojka (1976) compared two methods of measuring mercury levels 
in fish. A new method, which they called “selective reduction,’ was compared to 
an established method, referred to as "the permanganate method." One advantage 
of selective reduction is that it allows simultaneous measurement of both inorganic 
mercury and methyl mercury. The mercury in each of 25 juvenile black marlin was 
measured by both techniques. The 25 measurements for each method (in ppm of 
mercury) and the differences are given in the following table. 





Fish Selective Reduction Permanganate Difference Signed Rank 





1 32 39 .07 415.5 
2 .40 AT .07 +15.5 
3 11 11 .00 
4 47 43 —.04 -11 
3 32 42 .10 +19 
6 35 30 —.05 —13.5 
7 32 43 AT +20 
8 63 98 35 +23 
9 50 86 36 +24 
10 60 719 19 +22 
11 38 :33 —.05 —13.5 
12 46 .45 —.01 =25 


(Continued) 


11.3 Comparing Paired Samples 451 





Fish Selective Reduction Permanganate Difference Signed Rank 





13 .20 22 .02 +6.5 
14 31 30 —.01 —2.5 
15 .62 .60 —.02 —6.5 
16 52 3 .01 42.5 
17 TI .85 .08 417.5 
18 .23 21 —.02 —6.5 
19 30 33 .03 +9.0 
20 .70 57 = l3 —21 

21 Al 43 02 +6.5 
22 53 49 —.04 —11 

23 .19 .20 .01 42.5 
24 al .35 .04 +11 

25 48 40 —.08 —17.5 





In analyzing such data, it is often informative to check whether the differences 
depend in some way on the level or size of the quantity being measured. The differ- 
ences versus the permanganate values are plotted in Figure 11.8. This plot is quite 
interesting. It appears that the differences are small for low permanganate values and 
larger for higher permanganate values. It is striking that the differences are all posi- 
tive and large for the highest four values. The investigators do not comment on these 
phenomena. It is not uncommon for the size of fluctuations to increase as the value 
being measured increases; the percent error may remain nearly constant but the actual 
error does not. For this reason, data of this nature are often analyzed on a log scale. 


=. 
T 
. 


Differences 








Permanganate value 


FIGURE 11.8 Plot of differences versus permanganate values. 
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Because the observations are paired (two measurements on each fish), we will 
use the paired f test for a parametric test. The sample size is large enough that the test 
should be robust against nonnormality. The mean difference is .04, and the standard 
deviation of the differences is .116. The t statistic is 1.724; with 24 degrees of freedom, 
this corresponds to a p-value of .094 for a two-sided test. Although this p-value is 
fairly small, the evidence against Ho: up = 0 is not overwhelming. The test does not 
reject at the significance level .05. 

The signed ranks are shown in the last column of the table above. Note that the 
single zero difference was set aside, and also note how the tied ranks were handled. 
The test statistic W, is 194.5. Under Ho, its mean and variance are 


24 x 25 
4 

24 x 25 x 49 
= 24 
Since n is greater than 20, we use the normalized test statistic, or 


T= Wi EW) = 1.27 
V Var(W.) 
The p-value for a two-sided test from the normal approximation is .20, which is not 
strong evidence against the null hypothesis. It is possible to correct for the presence 
of ties, but in this case the correction only amounts to changing the standard deviation 
of W, from 35 to 34.95. 

Neither the parametric nor the nonparametric test gives conclusive evidence that 
there is any systematic difference between the two methods of measurement. The 
informal graphical analysis does suggest, however, that there may be a difference for 
high concentrations of mercury. 


E(W,) = 150 








Var(W,.) = 1225 


Experimental Design 


This section covers some basic principles of the interpretation and design of experi- 
mental studies and illustrates them with case studies. 


Mammary Artery Ligation 


A person with coronary artery disease suffers from chest pain during exercise because 
the constricted arteries cannot deliver enough oxygen to the heart. The treatment 
of ligating the mammary arteries enjoyed a brief vogue; the basic idea was that 
ligating these arteries forced more blood to flow into the heart. This procedure had the 
advantage of being quite simple surgically, and it was widely publicized in an article 
in Reader's Digest (Ratcliffe 1957). Two years later, the results of a more careful study 
(Cobb et al. 1959) were published. In this study, a control group and an experimental 
group were established in the following way. When a prospective patient entered 
surgery, the surgeon made the necessary preliminary incisions prior to tying off the 
mammary artery. At that point, the surgeon opened a sealed envelope that contained 
instructions about whether to complete the operation by tying off the artery. Neither 
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the patient nor his attending physician knew whether the operation had actually been 
carried out. The study showed essentially no difference after the operation between 
the control group (no ligation) and the experimental group (ligation), although there 
was some suggestion that the control group had done better. 

The Ratcliffe and Cobb studies differ in that in the earlier one there was no 
control group and thus no benchmark by which to gauge improvement. The reported 
improvement of the patients in this earlier study could have been due to the placebo 
effect, which we discuss next. The design of the later study protected against possible 
unconscious biases by randomly assigning the control and experimental groups and 
by concealing from the patients and their physicians the actual nature of the treatment. 
Such a design is called a double-blind, randomized controlled experiment. 


The Placebo Effect 


The placebo effect refers to the effect produced by any treatment, including dummy 
pills (placebos), when the subject believes that he or she has been given an effective 
treatment. The possibility of a placebo effect makes the use of a blind design necessary 
in many experimental investigations. 

The placebo effect may not be due entirely to psychological factors, as was 
shown in an interesting experiment by Levine, Gordon, and Fields (1978). A group 
of subjects had teeth extracted. During the extraction, they were given nitrous oxide 
and local anesthesia. In the recovery room, they rated the amount of pain they were 
experiencing on a numerical scale. Two hours after surgery, the subjects were given 
a placebo and were again asked to rate their pain. An hour later, some of the subjects 
were given a placebo and some were given naloxone, a morphine antagonist. It is 
known that there are specific receptors to morphine in the brain and that the body 
can also release endorphins that bind to these sites. Naloxone blocks the morphine 
receptors. In the study, it was found that when those subjects who responded positively 
to the placebo received naloxone, they experienced an increase in pain that made their 
pain levels comparable to those of the patients who did not respond to the placebo. 
The implication is that those who responded to the placebo had produced endorphins, 
the actions of which were subsequently blocked by the naloxone. 

An instance of the placebo effect was demonstrated by a psychologist, Claude 
Steele (2002), who gave a math exam to a group of male and female undergraduates 
at Stanford University. One group (treatment) was told that the exam was gender- 
neutral, and the other group (controls) was not so informed. The men outperformed 
the women in the control group. In the treatment group, men and women performed 
equally well. Men in the treatment group did worse than men in the control group. 
(Economist Feb 21, 2002). 


The Lanarkshire Milk Experiment 


The importance of the randomized assignment of individuals (or other experimental 
units) to treatment and control groups is illustrated by a famous study known as the 
Lanarkshire milk experiment. In the spring of 1930, an experiment was carried out in 
Lanarkshire, Scotland, to determine the effect of providing free milk to schoolchildren. 
In each participating school, some children (treatment group) were given free milk 


454 


Chapter 11 


11.4.4 


Comparing Two Samples 


and others (controls) were not. The assignment of children to control or treatment 
was initially done at random; however, teachers were allowed to use their judgment 
in switching children between treatment and control to obtain a better balance of 
undernourished and well-nourished individuals in the groups. 

A paper by Gosset (1931), who published under the name Student (as in Stu- 
dent's t test), is a very interesting critique of the experiment. An examination of the 
data revealed that at the start of the experiment the controls were heavier and taller. 
Student conjectured that the teachers, perhaps unconsciously, had adjusted the initial 
randomization in a manner that placed more of the undernourished children in the 
treatment group. A further complication was caused by weighing the children with 
their clothes on. The experimental data were weight gains measured in late spring 
relative to early spring or late winter. The more well-to-do children probably tended to 
be better nourished and may have had heavier winter clothing than the poor children. 
Thus, the well-to-do children's weight gains were vitiated as a result of differences in 
clothing, which may have influenced comparisons between the treatment and control 
groups. 


The Portacaval Shunt 


Cirrhosis of the liver, to which alcoholics are prone, is a condition in which resistance 
to blood flow causes blood pressure in the liver to build up to dangerously high levels. 
Vessels may rupture, which may cause death. Surgeons have attempted to relieve this 
condition by connecting the portal artery, which feeds the liver, to the vena cava, 
one of the main veins returning to the heart, thus reducing blood flow through the 
liver. This procedure, called the Portacaval shunt, had been used for more than 20 
years when Grace, Muench, and Chalmers (1966) published an examination of 51 
studies of the method. They examined the design of each study (presence or absence 
of a control group and presence or absence of randomization) and the investigators’ 
conclusions (categorized as markedly enthusiastic, moderately enthusiastic, or not 
enthusiastic). The results are summarized in the following table, which speaks for 
itself: 











Enthusiasm 
Design Marked Moderate None 
No controls 24 7 1 
Nonrandomized controls 10 
Randomized controls 0 1 3 





The differences between the experiments that used controls and those that did 
not is not entirely surprising, because the placebo effect was probably operating. The 
importance of randomized assignment to treatment and control groups is illustrated 
by comparing the conclusions for the randomized and nonrandomized controlled 
experiments. Randomization can help to ensure against subtle unconscious biases that 
may creep into an experiment. For example, a physician might tend to recommend 
surgery for patients who are somewhat more robust than the average. Articulate 
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patients might be more likely to have an influence on the decision as to which group 
they are assigned to. 


FD&C Red No. 40 


This discussion follows Lagakos and Mosteller (1981). During the middle and late 
1970s, experiments were conducted to determine possible carcinogenic effects of a 
widely used food coloring, FD&C Red No. 40. One of the experiments involved 
500 male and 500 female mice. Both genders were divided into five groups: two 
control groups, a low-dose group, a medium-dose group, and a high-dose group. The 
mice were bred in the following way: Males and females were paired and before 
and during mating were given their prescribed dose of Red No. 40. The regime was 
continued during gestation and weaning of the young. From litters that had at least 
three pups of each sex, three of each sex were selected randomly and continued 
on their parents’ dosage throughout their lives. After 109-111 weeks, all the mice 
still living were killed. The presence or absence of reticuloendothelial tumors was 
of particular interest. Although there were significant differences between some of 
the treatment groups, the results were rather confusing. For example, there was a 
significant difference between the incidence rates for the two male control groups, 
and among the males the medium-dose group had the lowest incidence. 

Several experts were asked to examine the results of this and other experiments. 
Among them were Lagakos and Mosteller, who requested information on how the 
cages that housed the mice were arranged. There were three racks of cages, each 
containing five rows of seven cages in the front and five rows of seven cages in the 
back. Five mice were housed in each cage. The mice were assigned to the cages in a 
systematic way: The first male control group was in the top of the front of rack 1; the 
first female control group was in the bottom of the front of rack 1; and so on, ending 
with the high-dose females in the bottom of the back of rack 3 (Figure 11.9). Lagakos 
and Mosteller showed that there were effects due to cage position that could not be 
explained by gender or by dosage group. A random assignment of cage positions 
would have eliminated this confounding. Lagakos and Mosteller also suggested some 
experimental designs to systematically control for cage position. 


Rack 1 


Front ^ 


Back 5» 





FIGURE 11.9 Location of mice cages in racks. 
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It was also possible that a litter effect might be complicating the analysis, since 
littermates received the same treatment and littermates of the same sex were housed 
in the same or contiguous cages. In the presence of a litter effect, mice from the same 
litter might show less variability than that present among mice from different litters. 
This reduces the effective sample size—in the extreme case in which littermates react 
identically, the effective sample size is the number of litters, not the total number of 
mice. One way around this problem would have been to use only one mouse from 
each litter. 

The presence of a possible selection bias is another problem. Because mice were 
included in the experiment only if they came from a litter with at least three males 
and three females, offspring of possibly less healthy parents were excluded. This 
could be a serious problem since exposure to Red No. 40 might affect the parents’ 
health and the birth process. If, for example, among the high-dose mice, only the 
most hardy produced large enough litters, their offspring might be hardier than the 
controls’ offspring. 


Further Remarks on Randomization 


As well as guarding against possible biases on the part of the experimenter, the pro- 
cess of randomization tends to balance any factors that may be influential but are 
not explicitly controlled in the experiment. Time is often such a factor; background 
variables such as temperature, equipment calibration, line voltage, and chemical com- 
position can change slowly with time. In experiments that are run over some period of 
time, therefore, it is important to randomize the assignments to treatment and control 
over time. Time is not the only factor that should be randomized, however. In agricul- 
tural experiments, the positions of test plots in a field are often randomly assigned. 
In biological experiments with test animals, the locations of the animals’ cages may 
have an effect, as illustrated in the preceding section. 

Although rarer than in other areas, randomized experiments have been carried 
out in the social sciences as well (Economist Feb 28, 2002). Randomized trials have 
been used to evaluate such programs as driver training, as well as the criminal justice 
system and reduced classroom size. In evaluations of *whole-language" approaches to 
reading (in which children are taught to read by evaluating contextual clues rather than 
breaking down words), 52 randomized studies carried out by the National Reading 
Panel in 2000 showed that effective reading instruction requires phonics. Randomized 
studies of “scared straight” programs, in which juvenile delinquents are introduced 
to prison inmates, suggested that the likelihood of subsequent arrests is actually 
increased by such programs. 

Generally, if it is anticipated that a variable will have a significant effect, that 
variable should be included as one of the controlled factors in the experimental design. 
The matched-pairs design of this chapter can be used to control for a single factor. 
To control for more than one factor, factorial designs, which are briefly introduced in 
the next chapter, may be used. 
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Observational Studies, Confounding, 
and Bias in Graduate Admissions 


It is not always possible to conduct controlled experiments or use randomization. 
In evaluating some medical therapies, for example, a randomized, controlled experi- 
ment would be unethical if one therapy was strongly believed to be superior. For many 
problems of psychological interest (effects of parental modes of discipline, for exam- 
ple), it is impossible to conduct controlled experiments. In such situations, recourse 
is often made to observational studies. Hospital records may be examined to compare 
the outcomes of different therapies, or psychological records of children raised in 
different ways may be analyzed. Although such studies may be valuable, the results 
are seldom unequivocal. Because there is no randomization, it is always possible that 
the groups under comparison differ in respects other than their “treatments.” 

As an example, let us consider a study of gender bias in admissions to graduate 
school at the University of California at Berkeley (Bickel and O'Connell 1975). In 
the fall of 1973, 8442 men applied for admission to graduate studies at Berkeley, and 
44% were admitted; 4321 women applied, and 35% were admitted. If the men and 
women were similar in every respect other than sex, this would be strong evidence 
of sex bias. This was not a controlled, randomized experiment, however; sex was not 
randomly assigned to the applicants. As will be seen, the male and female applicants 
differed in other respects, which influenced admission. 

The following table shows admission rates for the six most popular majors on 
the Berkeley campus. 











Men Women 
Number of Percentage Number of Percentage 
Major Applicants Admitted Applicants Admitted 
A 825 62 108 82 
B 560 63 25 68 
Ç 325 37 593 34 
D 417 33 375 35 
E 191 28 393 34 
F 373 6 341 7 








If the percentages admitted are compared, women do not seem to be unfavorably 
treated. But when the combined admission rates for all six majors are calculated, it 
is found that 44% of the men and only 30% of the women were admitted, which 
seems paradoxical. The resolution of the paradox lies in the observation that the 
women tended to apply to majors that had low admission rates (C through F) and 
the men to majors that had relatively high admission rates (A and B). This factor 
was not controlled for, because the study was observational in nature; it was also 
"confounded" with the factor of interest, sex; randomization, had it been possible, 
would have tended to balance out the confounded factor. 

Confounding also plays an important role in studies of the effect of coffee 
drinking. Several studies have claimed to show a significant association of coffee 
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consumption with coronary disease. Clearly, randomized, controlled trials are not 
possible here—a randomly selected individual cannot be told that he or she is in the 
treatment group and must drink 10 cups of coffee a day for the next five years. Also, it 
is known that heavy coffee drinkers also tend to smoke more than average, so smoking 
is confounded with coffee drinking. Hennekens et al. (1976) review several studies 
in this area. 


Fishing Expeditions 


Another problem that sometimes flaws observational studies, and controlled exper- 
iments as well, is that they engage in “fishing expeditions.” For example, consider 
a hypothetical study of the effects of birth control pills. In such a case, it would be 
impossible to assign women to a treatment or a placebo at random, but a nonrandom- 
ized study might be conducted by carefully matching controls to treatments on such 
factors as age and medical history. The two groups might be followed up on for some 
time, with many variables being recorded for each subject such as blood pressure, 
psychological measures, and incidences of various medical problems. After termina- 
tion of the study, the two groups might be compared on each of these variables, and 
it might be found, say, that there was a "significant difference" in the incidence of 
melanoma. The problem with this "significant finding" is the following. Suppose that 
100 independent two-sample ¢ tests are conducted at the .05 level and that, in fact, all 
the null hypotheses are true. We would expect that five of the tests would produce a 
"significant" result. Although each of the tests has probability .05 of type I error, as a 
collection they do not simultaneously have œ = .05. The combined significance level 
is the probability that at least one of the null hypotheses is rejected: 


a = P{at least one Ho rejected} 
= 1 — P{no Ho rejected} 
= 1 — .95'™ = 994 


Thus, with very high probability, at least one “significant” result will be found, even 
if all the null hypotheses are true. 

There are no simple cures for this problem. One possibility is to regard the 
results of a fishing expedition as merely providing suggestions for further experiments. 
Alternatively, and in the same spirit, the data could be split randomly into two halves, 
one half for fishing in and the other half to be locked safely away, unexamined. 
“Significant” results from the first half could then be tested on the second half. A 
third alternative is to conduct each individual hypothesis test at a small significance 
level. To see how this works, suppose that all null hypotheses are true and that each 
of n null hypotheses is tested at level o. Let R; denote the event that the ith null 
hypothesis is rejected, and let w* denote the overall probability of a type I error. Then 


a* = P{R, or Ro or --- or Ry} 
< P{Ri}  P(RS) +--+ + PR 
= na 


Thus, if each of the n null hypotheses is tested at level w/n, the overall significance 
level is less than or equal to o. This is often called the Bonferroni method. 
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Concluding Remarks 


This chapter was concerned with the problem of comparing two samples. Within this 
context, the fundamental statistical concepts of estimation and hypothesis testing, 
which were introduced in earlier chapters, were extended and utilized. The chapter 
also showed how informal descriptive and data analytic techniques are used in sup- 
plementing more formal analysis of data. Chapter 12 will extend the techniques of 
this chapter to deal with multisample problems. Chapter 13 is concerned with similar 
problems that arise in the analysis of qualitative data. 

We considered two types of experiments, those with two independent samples 
and those with matched pairs. For the case of independent samples, we developed 
the f test, based on an assumption of normality, as well as a modification of the t test 
that takes into account possibly unequal variances. The Mann-Whitney test, based on 
ranks, was presented as a nonparametric method, that is, a method that is not based 
on an assumption of a particular distribution. Similarly, for the matched-pairs design, 
we developed a parametric t test and a nonparametric test, the signed rank test. 

We discussed methods based on an assumption of normality and rank methods, 
which do not make this assumption. It turns out, rather surprisingly, that even if the 
normality assumption holds, the rank methods are quite powerful relative to the f test. 
Lehmann (1975) shows that the efficiency of the rank tests relative to that of the t 
test—that is, the ratio of sample sizes required to attain the same power—is typically 
around .95 if the distributions are normal. Thus, a rank test using a sample of size 
100 is as powerful as a t test based on 95 observations. Collecting the extra 5 pieces 
of data is a small price to pay for a safeguard against nonnormality. 

The bootstrap appeared again in this chapter. Indeed, uses of this recently de- 
veloped technique are finding applications in a great variety of statistical problems. 
In contrast with earlier chapters, where bootstrap samples were generated from one 
distribution, here we have bootstrapped from two empirical distributions. 

The chapter concluded with a discussion of experimental design, which empha- 
sized the importance of incorporating controls and randomization in investigations. 
Possible problems associated with observational studies were discussed. Finally, the 
difficulties encountered in making many comparisons from a single data set were 
pointed out; such problems of multiplicity will come up again in Chapter 12. 


11.6 Problems 


1. Acomputer was used to generate four random numbers from a normal distribution 
with a set mean and variance: 1.1650, .6268, .0751, .3516. Five more random 
normal numbers with the same variance but perhaps a different mean were then 
generated (the mean may or may not actually be different): .3035, 2.6961, 1.0591, 
2.7971, 1.2641. 


a. What do you think the means of the random normal number generators were? 
What do you think the difference of the means was? 

b. What do you think the variance of the random number generator was? 

c. What is the estimated standard error of your estimate of the difference of the 
means? 
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d. 


Form a 9096 confidence interval for the difference of the means of the random 
number generators. 

In this situation, is it more appropriate to use a one-sided test or a two-sided 
test of the equality of the means? 

What is the p-value of a two-sided test of the null hypothesis of equal means? 


g. Would the hypothesis that the means were the same versus a two-sided alter- 


native be rejected at the significance level a = .1? 
Suppose you know that the variance of the normal distribution was o? — 1. 
How would your answers to the preceding questions change? 


2. The difference of the means of two normal distributions with equal variance is to 
beestimated by sampling an equal number of observations from each distribution. 
If it were possible, would it be better to halve the standard deviations of the 
populations or double the sample sizes? 


3. In Section 11.2.1, we considered two methods of estimating Var(X — Y). Under 
the assumption that the two population variances were equal, we estimated this 
quantity by 


and without this assumption by 


Sx | Sy 
= + a 
n m 


Show that these two estimates are identical if m = n. 


4. Respond to the following: 


Using the f distribution is absolutely ridiculous—another example of de- 
liberate mystification! It's valid when the populations are normal and have 
equal variance. If the sample sizes were so small that the ¢ distribution were 
practically different from the normal distribution, you would be unable to 
check these assumptions. 


5. Respond to the following: 


Here is another example of deliberate mystification—the idea of formulating 
and testing a null hypothesis. Let's take Example A of Section 11.2.1. It 
seems to me that it is inconceivable that the expected values of any two 
methods of measurement could be exactly equal. It is certain that there will 
be subtle differences at the very least. What is the sense, then, in testing 
Ho: ux = ny? 


6. Respond to the following: 


I have two batches of numbers and I have a corresponding x and y. Why 
should I test whether they are equal when I can just see whether they are or 
not? 


7. Inthe development of Section 11.2.1, where are the following assumptions used? 
(1) Xi, X», ..., X, are independent random variables; (2) Y1, Yo, ..., Y, are 
independent random variables; (3) the X's and Y's are independent. 


8. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 
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Anexperimentto determine the efficacy of a drug for reducing high blood pressure 
is performed using four subjects in the following way: two of the subjects are 
chosen at random for the control group and two for the treatment group. During 
the course of treatment with the drug, the blood pressure of each of the subjects in 
the treatment group is measured for ten consecutive days as is the blood pressure 
of each of the subjects in the control group. 


a. Inordertotest whetherthe treatment has an effect, do you think itis appropriate 
to use the two-sample ft test with n = m = 20? 
b. Do you think it is appropriate to use the Mann-Whitney test with n = m = 20? 


Referring to the data in Section 11.2.1.1, compare iron retention at concentra- 
tions of 10.2 and .3 millimolar using graphical procedures and parametric and 
nonparametric tests. Write a brief summary of your conclusions. 


Verify that the two-sample ¢ test at level a of Ho: ux = uy versus H4: ux # Hy 
rejects if and only if the confidence interval for yx — uy does not contain zero. 


Explain how to modify the ¢ test of Section 11.2.1 to test Ho: ux = uy +A 
versus H4: ux 4 Hy + A where A is specified. 


An equivalence between hypothesis tests and confidence intervals was demon- 
strated in Chapter 9. In Chapter 10, a nonparametric confidence interval for the 
median, 7, was derived. Explain how to use this confidence interval to test the 
hypothesis Ho: n = no. In the case where no = 0, show that using this approach 
on a sample of differences from a paired experiment is equivalent to the sign 
test. The sign test counts the number of positive differences and uses the fact 
that in the case that the null hypothesis is true, the distribution of the number of 
positive differences is binomial with (n, .5). Apply the sign test to the data from 
the measurement of mercury levels, listed in Section 11.3.3. 


Let X,,..., X25 bei.i.d. N(.3, 1). Consider testing the null hypothesis Ho: u = 0 
versus H4: u > O at significance level a = .05. Compare the power of the sign 
test and the power of the test based on normal theory assuming that o is known. 


Suppose that X,,..., X, are iid. N (u, o?). To test the null hypothesis Ho: u = 
Ho, the t test is often used: 





Under Hp, t follows a t distribution with n — 1 df. Show that the likelihood ratio 
test of this Ho is equivalent to the f test. 


Suppose that n measurements are to be taken under a treatment condition and 
another n measurements are to be taken independently under a control condi- 
tion. It is thought that the standard deviation of a single observation is about 10 
under both conditions. How large should n be so that a 95% confidence inter- 
val for ux — uy has a width of 2? Use the normal distribution rather than the t 
distribution, since n will turn out to be rather large. 


Referring to Problem 15, how large should n be so that the test of Ho: wx = uy 
against the one-sided alternative H4: ux > jy has a power of .Sif ux — uy = 2 
anda = .10? 
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17. 


18. 


19. 


20. 


21. 


Consider conducting a two-sided test of the null hypothesis Ho: wy = py as 
described in Problem 16. Sketch power curves for (a) « = .05, n = 20; (D) a = 
.10, n = 20; (c) a = .05, n = 40; (d) « = .10, n = 40. Compare the curves. 


Two independent samples are to be compared to see if there is a difference in the 
population means. If a total of m subjects are available for the experiment, how 
should this total be allocated between the two samples in order to (a) provide the 
shortest confidence interval for yx — uy and (b) make the test of Ho: wy = uy 
as powerful as possible? Assume that the observations in the two samples are 
normally distributed with the same variance. 


An experiment is planned to compare the mean of a control group to the mean 

of an independent sample of a group given a treatment. Suppose that there are to 

be 25 samples in each group. Suppose that the observations are approximately 

normally distributed and that the standard deviation of a single measurement in 

either group is o = 5. 

a. What will the standard error of Y — X be? 

b. With a significance level aw = .05, what is the rejection region of the test of 
the null hypothesis Ho: yy = ux versus the alternative H4: uy > ux? 

c. What is the power of the test if wy = wy + 1? 

d. Suppose that the p-value of the test turns out to be 0.07. Would the test reject 
at significance level æ = .10? 

e. What is the rejection region if the alternative is H4: yy # mx? What is the 
power if wy = ux +1? 


Consider Example A of Section 11.3.1 using a Bayesian model. As in the ex- 
ample, use a normal model for the differences and also use an improper prior 
for the expected difference and the precision (as in the case of unknown mean 
and variance in Section 8.6). Find the posterior probability that the expected 
difference is positive. Find a 90% posterior credibility interval for the expected 
difference. 


A study was done to compare the performances of engine bearings made 
of different compounds (McCool 1979). Ten bearings of each type were tested. 
The following table gives the times until failure (in units of millions of 
cycles): 








Type I Type II 
3.03 3.19 
5.53 4.26 
5.60 4.47 
9.30 4.53 
9.92 4.67 

12.51 4.69 

12.95 12.78 

15.21 6.79 

16.04 9.37 


16.84 12.75 





22. 


23. 


24. 


25. 


26. 
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a. Use normal theory to test the hypothesis that there is no difference between 
the two types of bearings. 

b. Test the same hypothesis using a nonparametric method. 

c. Which of the methods—that of part (a) or that of part (b)—do you think is 
better in this case? 

d. Estimate zr, the probability that a type I bearing will outlast a type II bearing. 

e. Use the bootstrap to estimate the sampling distribution of £t and its standard 
error. 

f. Use the bootstrap to find an approximate 90% confidence interval for zr. 


An experiment was done to compare two methods of measuring the calcium 
content of animal feeds. The standard method uses calcium oxalate precipitation 
followed by titration and is quite time-consuming. A new method using flame 
photometry is faster. Measurements of the percent calcium content made by each 
method of 118 routine feed samples (Heckman 1960) are contained in the file 
calcium. Analyze the data to see if there is any systematic difference between 
the two methods. Use both parametric and nonparametric tests and graphical 
methods. 


Let X,,..., Xn bei.i.d. with cdf F, and let Y;,..., Ym bei.1.d. with cdf G. The 

hypothesis to be tested is that F = G. Suppose for simplicity that m + n is even 

so that in the combined sample of X's and Y's, (m + n)/2 observations are less 

than the median and (m + n)/2 are greater. 

a. As a test statistic, consider T, the number of X's less than the median of the 
combined sample. Show that T follows a hypergeometric distribution under 
the null hypothesis: 


(m+ A (" + 


t n—t 


Vy 


Explain how to form a rejection region for this test. 





"- 


b. Show how to find a confidence interval for the difference between the median 


of F and the median of G under the shift model, G(x) = F(x — ^). (Hint: 
Use the order statistics.) 
c. Apply the results (a) and (b) to the data of Problem 21. 


Find the exact null distribution of the Mann-Whitney statistic, Uy, in the case 
where m — 3 and n — 2. 


Referring to Example A in Section 11.2.1, (a) if the smallest observation for 
method B (79.94) is made arbitrarily small, will the ¢ test still reject? (b) If the 
largest observation for method B (80.03) is made arbitrarily large, will the rf test 
still reject? (c) Answer the same questions for the Mann-Whitney test. 


Let X,,..., X, be a sample from an N (0, 1) distribution and let Y;,..., Y, be 
an independent sample from an N (1, 1) distribution. 


a. Determine the expected rank sum of the X's. 
b. Determine the variance of the rank sum of the X's. 
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27. 
28. 


29. 


30. 


31. 


32. 


33. 


34. 


Find the exact null distribution of W, in the case where n = 4. 


For n — 10, 20, and 30, find the .05 and .01 critical values for a two-sided signed 
rank test from the tables and then by using the normal approximation. Compare 
the values. 


(Permutation Test for Means) Here is another view on hypothesis testing that 
we will illustrate with Example A of Section 11.2.1. We ask whether the mea- 
surements produced by methods A and B are identical or exchangeable in the 
following sense. There are 13 -- 8 — 21 measurements in all and there are I ; 
or about 2 x 10°, ways that 8 of these could be assigned to method B. Is the 
particular assignment we have observed unusual among these in the sense that 


the means of the two samples are unusually different? 


a. It's not inconceivable, but it may be asking too much for you to generate all 
c3 partitions. So just choose a random sample of these partitions, say of size 
1000, and make a histogram of the resulting values of X4 — Xg. Where on 
this distribution does the value of X4 — Xg that was actually observed fall? 
Compare to the result of Example B of Section 11.2.1. 

b. In what way is this procedure similar to the Mann-Whitney test? 


Use the bootstrap to estimate the standard error of and a confidence interval for 
X4 — Xz and compare to the result of Example A of Section 11.2.1. 


In Section 11.2.3, if F = G, what are E(ĉ) and Var(?)? Would there be any 
advantage in using equal sample sizes m = n in estimating z or does it make no 
difference? 


If X ~ N(x, og) and Y is independent N (jy, 07), what is 7 = P(X < Y) in 
terms of Hy, Uy, Ox, and oy? 


To compare two variances in the normal case, let X4, ..., X, bei.i.d. N (ux, dz); 
and let Y;,..., Ym beii.d. N (uy, oy), where the X's and Y's are independent 
samples. Argue that under Ho: ox — oy, 


2 n—1,m-1 


. Construct rejection regions for one- and two-sided tests of Hp. 

. Construct a confidence interval for the ratio o7 /o7. 

c. Apply the results of parts (a) and (b) to Example A in Section 11.2.1. (Cau- 
tion: This test and confidence interval are not robust against violations of the 
assumption of normality.) 


E 


This problem contrasts the power functions of paired and unpaired designs. Graph 

and compare the power curves for testing Ho: ux = jy for the following two 

designs. 

a. Paired: Cov(X;, Y;) = 50, ox = oy = 10,i = 1,...,25. 

b. Unpaired: X,,..., X25 and Y;,..., Yo; are independent with variance as in 
part (a). 





35. 
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An experiment was done to measure the effects of ozone, a component of smog. 
A group of 22 seventy-day-old rats were kept in an environment containing ozone 
for 7 days, and their weight gains were recorded. Another group of 23 rats of a 
similar age were kept in an ozone-free environment for a similar time, and their 
weight gains were recorded. The data (in grams) are given below. Analyze the 
data to determine the effect of ozone. Write a summary of your conclusions. 
[This problem is from Doksum and Sievers (1976) who provide an interesting 
analysis.] 











Controls Ozone 
41.0 38.4 24.9 10.1 6.1 20.4 
25.9 21.9 18.3 7.3 14.3 15.5 
13.1 21.3 28.5 —9.9 6.8 28.2 
—16.9 17.4 21.8 17.9 —12.9 14.0 
15.4 27.4 19.2 6.6 12.1 15.7 
22.4 17.7 26.0 39.9 —15.9 54.6 
29.4 21.4 227] —14.7 44.1 —9.0 
26.0 26.6 —9.0 


Lin, Sutton, and Qurashi (1979) compared microbiological and hydroxylamine 
methods for the analysis of ampicillin dosages. In one series of experiments, pairs 
of tablets were analyzed by the two methods. The data in the following table give 
the percentages of claimed amount of ampicillin found by the two methods in 
several pairs of tablets. What are X — Y and sẹ_y? If the pairing had been erro- 
neously ignored and it had been assumed that the two samples were independent, 
what would have been the estimate of the standard deviation of X — Y? Ana- 
lyze the data to determine if there is a systematic difference between the two 
methods. 








Microbiological Method Hydroxylamine Method 
97.2 97.2 
105.8 97.8 
99.5 96.2 
100.0 101.8 
93.8 88.0 
79.2 74.0 
72.0 75.0 
72.0 67.5 
69.5 65.8 
20.5 21.2 
95.2 94.8 
90.8 95.8 
96.2 98.0 
96.2 99.0 


91.0 100.2 
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37. 


38. 


Stanley and Walton (1961) ran a controlled clinical trial to investigate the effect 
of the drug stelazine on chronic schizophrenics. The trials were conducted on 
chronic schizophrenics in two closed wards. In each ofthe wards, the patients were 
divided into two groups matched for age, length of time in the hospital, and score 
on a behavior rating sheet. One member of each pair was given stelazine, and the 
other a placebo. Only the hospital pharmacist knew which member of each pair 
received the actual drug. The following table gives the behavioral rating scores 
for the patients at the beginning of the trial and after 3 mo. High scores are good. 





























Ward A 
Stelazine Placebo 
Before After Before After 
2:3 3.1 2.4 2.0 
2.0 2.1 2.2 2.6 
1.9 2.45 2.1 2.0 
3.1 3.7 2.9 2.0 
233 2.54 23. 2.4 
23 3.72 2.4 3.18 
2.8 4.54 2.7 3.0 
1.9 1.61 1.9 2.54 
11 1.63 1.3 1:72 
Ward B 
Stelazine Placebo 
Before After | Before After 
1.9 1.45 1.9 1.91 
23 2.45 2.4 2.54 
2.0 1.81 2.0 1.45 
1.6 1.72 1:5 1.45 
1.6 1.63 1.5 1.54 
2.6 2.45 2:7 1.54 
1.7 2.18 1.7 1.54 


a. For each of the wards, test whether stelazine is associated with improvement 
in the patients’ scores. 

b. Test if there is any difference in improvement between the wards. [These data 
are also presented in Lehmann (1975), who discusses methods of combining 
the data from the wards. ] 


Bailey, Cox, and Springer (1978) used high-pressure liquid chromatography to 
measure the amounts of various intermediates and by-products in food dyes. The 
following table gives the percentages added and found for two substances in the 
dye FD&C Yellow No. 5. Is there any evidence that the amounts found differ 
systematically from the amounts added? 


11.6 Problems 


467 





Sulfanilic Acid 


Pyrazolone-T 





Percentage Added Percentage Found 


Percentage Added Percentage Found 





.048 
.096 
20 
.19 
.096 
.18 
.080 
24 


.060 
.091 
.16 
.16 
.091 
.19 
.070 
23 
0 
042 
.056 





.035 
.087 


.19 
.19 
.16 
.032 
.060 
.13 
.080 


.031 
.084 
.16 
A7 
AS 
.040 
.076 
ll 
.082 


39. An experiment was done to test a method for reducing faults on telephone lines 
(Welch 1987). Fourteen matched pairs of areas were used. The following table 
shows the fault rates for the control areas and for the test areas: 


40. 


DE 








Test Control 
676 88 
206 570 
230 605 
256 617 
280 653 
433 2913 
337 924 
466 286 
497 1098 
512 982 
794 2346 
428 321 
452 615 
512 519 





Plot the differences versus the control rate and summarize what you see. 
Calculate the mean difference, its standard deviation, and a confidence interval. 


c. Calculate the median difference and a confidence interval and compare to the 
previous result. 
d. Do you think it is more appropriate to use a f test or a nonparametric method to 
test whether the apparent difference between test and control could be due to 
chance? Why? Carry out both tests and compare. 


Biological effects of magnetic fields are a matter of current concern and research. 
In an early study of the effects of a strong magnetic field on the development of 
mice (Barnothy 1964), 10 cages, each containing three 30-day-old albino female 
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mice, were subjected for a period of 12 days to a field with an average strength 
of 80 Oe/cm. Thirty other mice housed in 10 similar cages were not placed in 
a magnetic field and served as controls. The following table shows the weight 
gains, in grams, for each of the cages. 


a. 


Display the data graphically with parallel dotplots. (Draw two parallel num- 
ber lines and put dots on one corresponding to the weight gains of the con- 
trols and on the other at points corresponding to the gains of the treatment 


group.) 


. Find a 95% confidence interval for the difference of the mean weight 


gains. 


. Use a t test to assess the statistical significance of the observed difference. 


What is the p-value of the test? 


d. Repeat using a nonparametric test. 


e 


. What is the difference of the median weight gains? 
. Use the bootstrap to estimate the standard error of the difference of median 


weight gains. 


. Form a confidence interval for the difference of median weight gains based 


on the bootstrap approximation to the sampling distribution. 








Field Present Field Absent 
22.8 23.5 
10.2 31.0 
20.8 19.5 
21.0 26.2 
19.2 26.5 

9.0 25.2 
14.2 24.5 
19.8 23.8 
14.5 27.8 
14.8 22.0 





41. The Hodges-Lehmann shift estimate is defined to be A = median(X; — 15); 
where X4, X5, ..., X, are independent observations from a distribution F and 
Yi, Y), ..., Ym are independent observations from a distribution G and are inde- 
pendent of the X;. 


a. 
b. 
c. 


Show that if F and G are normal distributions, then E(A) = [Lx — Hy. 
Why is A robust to outliers? 

What is A for the previous problem and how does it compare to the differences 
of the means and of the medians? 


. Use the bootstrap to approximate the sampling distribution and the standard 


error of A. 


. From the bootstrap approximation to the sampling distribution, form an ap- 


proximate 90% confidence interval for A. 


42. 


43. 


44. 
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Use the data of Problem 40 of Chapter 10. 


a. Estimate zr, the probability that more rain will fall from a randomly selected 
seeded cloud than from a randomly selected unseeded cloud. 

b. Use the bootstrap to estimate the standard error of 7. 

C. Use the bootstrap to form an approximate confidence interval for 7. 


Suppose that X1, X5, ..., X, and Yi, Yo,..., Ym are two independent samples. 
As a measure of the difference in location of the two samples, the difference 
of the 2096 trimmed means is used. Explain how the bootstrap could be used to 
estimate the standard error of this difference. 


Interest in the role of vitamin C in mental illness in general and schizophrenia 
in particular was spurred by a paper of Linus Pauling in 1968. This exercise 
takes its data from a study of plasma levels and urinary vitamin C excretion in 
schizophrenic patients (Suboticanec et al. 1986). Twenty schizophrenic patients 
and 15 controls with a diagnosis of neurosis of different origin who had been 
patients at the same hospital for a minimum of 2 months were selected for the 
study. Before the experiment, all the subjects were on the same basic hospital 
diet. A sample of 2 ml of venous blood for vitamin C determination was drawn 
from each subject before breakfast and after the subjects had emptied their blad- 
ders. Each subject was then given 1 g ascorbic acid dissolved in water. No foods 
containing ascorbic acid were available during the test. For the next 6 h all urine 
was collected from the subjects for assay of vitamin C. A second blood sample 
was also drawn 2 h after the dose of vitamin C. 
The following two tables show the plasma concentrations (mg/dl). 











Schizophrenics Nonschizophrenics 
Oh 2h Oh 2h 
55, 1.22 1.27 2.00 
.60 1.54 .09 Al 
21 OT. 1.64 2.37 
.09 .45 23 Al 
1.01 1.54 .18 79 
24 415 12 94 
ot 1.12 85 1.72 
1.01 1.31 .69 1.75 
26 .92 78 1.60 
.30 1.27 .63 1.80 
26 1.08 .50 2.08 
.10 1.19 .62 1.58 
42 .64 .19 .86 
ll 30 .66 1.92 
14 24 91 1.54 
.20 89 
.09 24 
32 1.68 
.24 99 
25, .67 
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a. Graphically compare the two groups at the two times and for the difference 


in concentration at the two times. 


b. Use the f test to assess the strength of the evidence for differences between 


the two groups at 0 h, at 2 h, and the difference 2 h — Oh. 


c. Use the Mann-Whitney test to test the hypotheses of (b). 


The following tables show the amounts of urinary vitamin C, both total 
and milligrams per kilogram of body weight, for the two groups: 











Schizophrenics Nonschizophrenics 
Total mg/kg | Total mg/kg 
16.6 .19 289.4 3.96 
33.3 44 0.0 0.00 
34.1 39 620.4 7.95 
0.0 .00 0.0 0.00 
119.8 1.75 8.5 .10 
wl .01 5.5 .09 
25.3 27 43.2 91 
359.3 5.99 91.7 1.00 
6.6 .10 200.9 3.46 
4 .01 113.8 2.01 
62.8 .68 102.2 1.50 
2 .01 108.2 1.98 
13.0 AS 36.9 49 
0.0 0.00 122.0 1.72 
0.0 0.00 101.9 1:52 
5.9 .10 
al .01 
6.0 .07 
32.1 .42 
0.0 0.00 


d. Use descriptive statistics and graphical presentations to compare the two 





groups with respect to total excretion and mg/kg body weight. Do the data 
look normally distributed? 


. Use a t test to compare the two groups on both variables. Is the normality 


assumption reasonable? 


. Use the Mann-Whitney test to compare the two groups. How do the results 


compare with those obtained in part (e)? 


The lower levels of plasma vitamin C in the schizophrenics before admin- 
istration of ascorbic acid could be attributed to several factors. Interindividual 
differences in the intake of meals cannot be excluded, despite the fact that 
all patients were offered the same food. A more interesting possibility is 
that the differences are the result of poorer resorption or of higher ascorbic 
acid utilization in schizophrenics. In order to answer this question, another 
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experiment was run on 15 schizophrenics and 15 controls. All subjects were 
given 70 mg of ascorbic acid daily for 4 weeks before the ascorbic acid load- 
ing test. The following table shows the concentration of plasma vitamin C 
(mg/dl) and the 6-h urinary excretion (mg) after administration of 1 g ascorbic 
acid. 














Schizophrenics Controls 
Plasma Urine Plasma Urine 
12 86.20 1.02 190.14 
dl 21.55 86 149.76 
.96 182.07 78 285.27 
1.23 88.28 1.38 244.93 
-16 76.58 95 184.45 
75 18.81 1.00 135.34 
1.26 50.02 47 157.74 
64 107.74 .60 125.65 
.67 .09 1.15 164.98 
1.05 113.23 .86 99.65 
1.28 34.38 61 86.29 
54 8.44 1.01 142.23 
TI 109.03 a 144.60 
1.11 144.44 Tq 265.40 
d 172.09 94 28.26 


g. Use graphical methods and descriptive statistics to compare the two groups 
with respect to plasma concentrations and urinary excretion. 

h. Use the f test to compare the two groups on the two variables. Does the 
normality assumption look reasonable? 

i. Compare the two groups using the Mann-Whitney test. 


45. This and the next two problems are based on discussions and data in Le Cam 
and Neyman (1967), which is devoted to the analysis of weather modification 
experiments. The examples illustrate some ways in which principles of experi- 
mental design have been used in this field. During the summers of 1957 through 
1960, a series of randomized cloud-seeding experiments were carried out in the 
mountains of Arizona. Of each pair of successive days, one day was randomly 
selected for seeding to be done. The seeding was done during a two-hour to 
four-hour period starting at midday, and rainfall during the afternoon was mea- 
sured by a network of 29 gauges. The data for the four years are given in the 
following table (in inches). Observations in this table are listed in chronological 
order. 


a. Analyze the data for each year and for the years pooled together to see if there 
appears to be any effect due to seeding. You should use graphical descriptive 
methods to get a qualitative impression of the results and hypothesis tests to 
assess the significance of the results. 
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b. Why should the day on which seeding is to be done be chosen at random rather 
than just alternating seeded and unseeded days? Why should the days be paired 


at all, rather than just deciding randomly which days to seed? 





1957 1958 1959 1960 
Seeded Unseeded | Seeded Unseeded | Seeded Unseeded | Seeded Unseeded 
0 .154 .152 013 .015 0 0 .010 
.154 0 0 0 0 0 0 0 
.003 .008 0 445 0 .086 .042 .057 
.084 .033 .002 0 021 .006 0 0 
.002 .035 007 .079 0 .115 0 .093 
.157 .007 .013 .006 004 090 0 .183 
.010 .140 .161 .008 .010 0 .152 0 
0 .022 0 .001 0 0 0 0 
.002 0 274 .001 .055 0 0 0 
.078 .074 001 .025 004 .076 0 0 
.101 .002 122 .046 .053 .090 0 0 
.169 318 101 .007 0 0 0 0 
.139 .096 .012 019 0 .078 .008 0 
.172 0 .002 0 .090 .121 .040 .060 
0 0 .066 0 028 1.027 .003 .102 
0 .050 040 012 0 .104 O11 041 
.032 .023 
133 .172 
.083 .002 
0 0 











46. 'The National Weather Bureau's ACN cloud-seeding project was carried out in 


the states of Oregon and Washington. Cloud seeding was accomplished by dis- 
persing dry ice from an aircraft; only clouds that were deemed "ripe" for seeding 
were candidates for seeding. On each occasion, a decision was made at random 
whether to seed, the probability of seeding being Z, This resulted in 22 seeded and 
13 control cases. Three types of targets were considered, two of which are dealt 
with in this problem. Type I targets were large geographical areas downwind from 
the seeding; type II targets were sections of type I targets located so as to have, 
theoretically, the greatest sensitivity to cloud seeding. The following table gives 
the average target rainfalls (in inches) for the seeded and control cases, listed in 
chronological order. Is there evidence that seeding has an effect on either type of 
target? In what ways is the design of this experiment different from that of the 
one in Problem 45? 
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Control Cases Seeded Cases 
TypeI Type II TypeI Type II 
.0080 .0000 .1218 .0200 
.0046 .0000 .0403 .0163 
.0549 .0053 .1166 .1560 
.1313 .0920 .2375 .2885 
.0587 .0220 .1256 .1483 
.1723 .1133 .1400 .1019 
.3812 .2880 .2439 .1867 
.1720 .0000 .0072 .0233 
.1182 .1058 .0707 .1067 
.1383 .2050 .1036 .1011 
.0106 .0100 .1632 .2407 
2126 .2450 .0788 .0666 
.1435 .1529 .0365 .0133 
.2409 .2897 
.0408 .0425 
.2204 .2191 
.1847 .0789 
.3332 .3570 
.0676 .0760 
.1097 .0913 
.0952 .0400 
.2095 .1467 





47. During 1963 and 1964, an experiment was carried out in France; its design dif- 
fered somewhat from those of the previous two problems. A 1500-km target area 
was selected, and an adjacent area of about the same size was designated as the 
control area; 33 ground generators were used to produce silver iodide to seed 
the target area. Precipitation was measured by a network of gauges for each suit- 
able “rainy period,” which was defined as a sequence of periods of continuous 
precipitation between dry spells of a specified length. When a forecaster deter- 
mined that the situation was favorable for seeding, he telephoned an order to 
a service agent, who then opened a sealed envelope that contained an order to 
actually seed or not. The envelopes had been prepared in advance, using a table 
of random numbers. The following table gives precipitation (in inches) in the 
target and control areas for the seeded and unseeded periods. 


a. Analyze the data, which are listed in chronological order, to see if there is an 
effect of seeding. 

b. The analysis done by the French investigators used the square root transfor- 
mation in order to make normal theory more applicable. Do you think that 
taking the square root was an effective transformation for this purpose? 

c. Reflect on the nature of this design. In particular, what advantage is there to 
using the control area? Why not just compare seeded and unseeded periods 
on the target area? 
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Seeded Unseeded 
Target Control | Target Control 
1.6 1.0 1.1 2.2 
28.1 27.0 3.5 52 
7.8 3 2.6 0.0 
4.0 6.0 2.6 2.0 
9.6 12.6 9.8 4.9 
0.2 0.5 5.6 8.5 
18.7 8.7 wl 3:5 
16.5 21.5 0.0 1.1 
4.6 13.9 17.7 11.0 
9.3 6.7 19.4 19.8 
3.5 4.5 8.9 5.3 
0.1 0.7 10.6 8.9 
11:5 8.7 10.2 4.5 
0.0 0.0 16.0 13.0 
9.3 10.7 9.7 21.1 
5.5 4.7 21.4 15.9 
70.2 29.1 6.1 19.5 
0.7 1.9 24.3 16.3 
38.6 34.7 20.9 6.3 
11:3 10.2 60.2 47.0 
3.3 2.7 15:2. 10.8 
8.9 2.8 2T 4.8 
11.1 4.3 0.3 0.0 
64.3 38.7 12.2 5.7 
16.6 11.1 22; 5.1 
7.3 6.5 23.3 30.6 
3.2 3.0 9.9 3.7 
23.9 13.6 
0.6 0.1 


48. Proteinuria, the presence of excess protein in urine, is asymptom of renal (kidney) 
distress among diabetics. Taguma et al. (1985) studied the effects of captopril for 
treating proteinuria in diabetics. Urinary protein was measured for 12 patients 
before and after eight weeks of captopril therapy. The amounts of urinary protein 
(in g/24 hrs) before and after therapy are shown in the following table. What 
can you conclude about the effect of captopril? Consider using parametric or 
nonparametric methods and analyzing the data on the original scale or on a log 
scale. 


49. 


50. 


51. 
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Before After 
24.6 10.1 
17.0 5.7 
16.0 5.6 
10.4 3.4 

8.2 6.5 
7.9 0.7 
8.2 6.5 
7.9 0.7 
5.8 6.1 
5.4 4.7 
5.1 2.0 
4.7 2.9 





Egyptian researchers, Kamal et al. (1991), took a sample of 126 police officers 
subject to inhalation of vehicle exhaust in downtown Cairo and found an average 
blood level concentration of lead equal to 29.2 g/dl with a standard deviation 
of 7.5 g/dl. A sample of 50 policemen from a suburb, Abbasia, had an average 
concentration of 18.2 g/dl and a standard deviation of 5.8 g/dl. Form a confi- 
dence interval for the population difference and test the null hypothesis that there 
is no difference in the populations. 


The file bodytemp contains normal body temperature readings (degrees 
Fahrenheit) and heart rates (beats per minute) of 65 males (coded by 1) and 
65 females (coded by 2) from Shoemaker (1996). 


a. Using normal theory, form a 95% confidence interval for the difference of 
mean body temperatures between males and females. Is the use of the normal 
approximation reasonable? 

b. Using normal theory, form a 95% confidence interval for the difference of 
mean heart rates between males and females. Is the use of the normal approx- 
imation reasonable? 

c. Use both parametric and nonparametric tests to compare the body tempera- 

tures and heart rates. What do you conclude? 





A common symptom of otitis-media (inflamation of the middle ear) in young 
children is the prolonged presence of fluid in the middle ear, called middle-ear 
effusion. It is hypothesized that breast-fed babies tend to have less prolonged 
effusions than do bottle-fed babies. Rosner (2006) presents the results of a study 
of 24 pairs of infants who were matched according to sex, socioeconomic status, 
and type of medication taken. One member of each pair was bottle-fed and the 
other was breast-fed. The file ears gives the durations (in days) of middle-ear 
effusions after the first episode of otitis-media. 


a. Examine the data using graphical methods and summarize your conclusions. 

b. In order to test the hypothesis of no difference, do you think it is more appro- 
priate to use a parametric or a nonparametric test? Carry out a test. What do 
you conclude? 
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52. 


53. 


54. 


The media often present short reports of the results of experiments. To the crit- 
ical reader or listener, such reports often raise more questions than they answer. 
Comment on possible pitfalls in the interpretation of each of the following. 


a. Itis reported that patients whose hospital rooms have a window recover faster 
than those whose rooms do not. 

b. Nonsmoking wives whose husbands smoke have a cancer rate twice that of 
wives whose husbands do not smoke. 

c. A 2-year study in North Carolina found that 75% of all industrial accidents in 
the state happened to workers who had skipped breakfast. 

d. A school integration program involved busing children from minority schools 
to majority (primarily white) schools. Participation in the program was vol- 
untary. It was found that the students who were bused scored lower on stan- 
dardized tests than did their peers who chose not to be bused. 

e. When a group of students were asked to match pictures of newborns with 
pictures of their mothers, they were correct 36% of the time. 

f. A survey found that those who drank a moderate amount of beer were healthier 
than those who totally abstained from alcohol. 

g. A 15-year study of more than 45,000 Swedish soldiers revealed that heavy 
users of marijuana were six times more likely than nonusers to develop 
schizophrenia. 

h. A University of Wisconsin study showed that within 10 years of the wedding, 
38% of those who had lived together before marriage had split up, compared 
to 2796 of those who had married without a "trial period." 

i. A study of nearly 4,000 elderly North Carolinians has found that those who 
attended religious services every week were 46% less likely to die over a 
six-year period than people who attended less often or not at all, according to 
researchers at Duke University Medical Center. 


Explain why in Levine's experiment (Example A in Section 11.3.1) subjects also 
smoked cigarettes made of lettuce leaves and unlit cigarettes. 


This example is taken from an interesting article by Joiner (1981) and from data in 
Ryan, Joiner, and Ryan (1976). The National Institute of Standards and Technol- 
ogy supplies standard materials of many varieties to manufacturers and other par- 
ties, who use these materials to calibrate their own testing equipment. Great pains 
are taken to make these reference materials as homogeneous as possible. In an ex- 
periment, a long homogeneous steel rod was cut into 4-inch lengths, 20 of which 
were randomly selected and tested for oxygen content. Two measurements were 
made on each piece. The 40 measurements were made over a period of 5 days, 
with eight measurements per day. In order to avoid possible bias from time-related 
trends, the sequence of measurements was randomized. The file steelrods 
contains the measurements. There is an unexpected systematic source of variabil- 
ity in these data. Can you find it by making an appropriate plot? Would this effect 
have been detectable if the measurements had not been randomized over time? 


12.1 


12.2 


CHAPTER 12 


The Analysis of Variance 


Introduction 


Chapter 11 was concerned with the analysis of data arising from experimental designs 
with two samples. Experiments frequently involve more than two samples; they may 
compare several treatments, such as different drugs, and perhaps other factors, such 
as sex, at the same time. This chapter is an introduction to the statistical analysis 
of such experiments. The methods we will discuss are called analysis of variance. 
Contrary to what this phrase seems to imply, we will be primarily concerned with the 
comparison of the means of the data, not their variances. We will consider the two 
most elementary multisample designs: the one-way and two-way layouts. Methods 
based on the normal distribution and nonparametric methods will be developed. 


The One-Way Layout 


A one-way layout is an experimental design in which independent measurements 
are made under each of several treatments. The techniques we will introduce are thus 
generalizations of the techniques for comparing two independent samples that were 
covered in Chapter 11. 

In this section, we will use as an example data from Kirchhoefer (1979), who 
studied the measurement of chlorpheniramine maleate in tablets. Measurements of 
composites that had nominal dosages equal to 4 mg were made by seven laboratories, 
each laboratory making 10 measurements. Data is shown in the following table. There 
are two possible sources of variability in the data: variability within labs and variability 
between labs. 
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Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7 
4.13 3.86 4.00 3.88 4.02 4.02 4.00 
4.07 3.85 4.02 3.88 3.95 3.86 4.02 
4.04 4.08 4.01 3.91 4.02 3.96 4.03 
4.07 4.11 4.01 3.95 3.89 3.97 4.04 
4.05 4.08 4.04 3.92 3.91 4.00 4.10 
4.04 4.01 3.99 3.97 4.01 3.82 3.81 
4.02 4.02 4.03 3.92 3.89 3.98 3.91 
4.06 4.04 3.97 3.90 3.89 3.99 3.96 
4.10 3.97 3.98 3.97 3.99 4.02 4.05 
4.04 3.95 3.98 3.90 4.00 3.93 4.06 





Figure 12.1, a boxplot of these data, shows some variation in the medians among 
the seven labs, as well as some variation in the interquartile ranges. It appears from 
the figure that there may be some systematic differences between the labs and that 
there is less variability in some labs than in others. We will discuss the following 
question: Are the differences in the means of the measurements from the various labs 
significant, or might they be due to chance? 
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FIGURE 12.1 Boxplots of determinations of amounts of chlorpheniramine maleate 
in tablets by seven laboratories. 


12.2.1 Normal Theory; the F Test 


We first discuss the analysis of variance and the F test in the case of 7 groups, each 
containing J samples. The J groups will be referred to generically as treatments, or 
levels. (In the preceding example, J = 7 and J = 10. We will discuss the case of 
unequal sample sizes later.) 
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We first define some notation and introduce the basic model. Let 
Yi; = the jth observation of the ith treatment 


Our model is that the observations are corrupted by random errors and that the error in 
one observation is independent of the errors in the other observations. The statistical 
model is 


Yij = u+ ai t &ij 


Here u is the overall mean level, o; is the differential effect of the ith treatment, and 
£;j is the random error in the jth observation under the ith treatment. The errors are 
assumed to be independent, normally distributed with mean zero and variance o?. 


The o; are normalized: 


The expected response to the ith treatment is E(Y;j) = u + a;. Thus, if a; = 0, for 
i = l,..., Z, all treatments have the same expected response, and, in general, a; — a; 
is the difference between the expected values under treatments i and j. We will derive 
a test for the null hypothesis, which is that all the means are equal. 


The analysis of variance is based on the following identity: 


where 


is the average of the observations under the ith treatment and 


= 1 
Vag Doe 


i=l j 


Yij 
1 


J 
is the overall average. The terms appearing in the first identity above are called sums 
of squares, and the identity may be symbolically expressed as 


SSror = SSw + SS 


In words, this means that the total sum of squares equals the sum of squares within 
groups plus the sum of squares between groups. The terminology reflects that S Sy 
is a measure of the variation of the data within the treatment groups and that S5 is 
a measure of the variation of the treatment means among or between treatments. 

To establish the identity, we express the left-hand side as 


IOJ i 
5 » (Yij — Yy = » 5 [(Y — Yi)d (Y; -Y)f 
i i=l j=l 
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I J 
423 (Q0, -Y) M quy -Y) 
i=l j=l 
The last term of the final expression vanishes because the sum of deviations from a 
mean is zero. 
As we will see, the basic idea underlying analysis of variance is the comparison 
of the sizes of various sums of squares. We can calculate the expected values of the 
sums of squares defined previously using the following lemma. 


LEMMAA 


Let X;, where i = 1, ...,, be independent random variables with E(X;) = Mi 
and Var(X;) = o?. Then 
7 


m n—1 
E(X; — X = (nu; - T) + om 


where 


Sle 


i= 


Ds Hi 
i=1 


Proof 


We use the fact that E(U?) = [E (U)]? + Var(U) for any random variable U with 
finite variance. The first term on the right-hand side of the equation in the lemma 
follows immediately. For the second term, we have to calculate Var(X; — X): 


Var(X: — X) = Var(X;) + Var(X) — 2Cov(X;, X) 
and 
Var(X;) = o? 
E TRO 
Var(X) = —o* 
n 


cem iene íl 
Cov(X;, X) = Cov | X;, — XIS mr 
ov( D o( 22 J a 


(Here we have used Cov(X;, X;) = Oif i Æ j, since the X's are indepen- 
dent.) Putting these results together proves the lemma. a 
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Lemma A may be applied to the sums of squares discussed before, yielding the 
following theorem. 


THEOREMA 


Under the assumptions for the model stated at the beginning of this section, 


JL 
E(SSy) =X » E(Y; = Y; 


LE 


Il 
DJs X 





Here we have used Lemma A, with the role of X; being played by Y;; and that 
of X being played by Y; . The second line then follows since E(Y;;) = E(Y;) = 
u + or. To find E(SS5), we again use the lemma with Y;, and Y in place of X; 
and X: 


I 
E(S$9) =J X E(Y, Y. 


i=l 


=J Ņ_ a? +U = Da m 


hl 


S Sy may be used to estimate 625 the estimate is 


uM 
S, = ————— 
2 Rm 


which is unbiased. The subscript p stands for pooled. Estimates of o? from the I 
treatments are pooled together, since SSw can be written as 


I 
SSw => (J = 0s? 


i=l 


where s? is the sample variance in the ith group. 

If all the œ; are equal to zero, then the expectation of $S5/(1 — 1) is also o°. 
Thus, in this case, SSy/[1(J — 1)] and SS5/(1 — 1) should be about equal. If some 
of the o; are nonzero, SS will be inflated. We next develop a method of comparing 
the two sums of squares to find a test statistic for testing the null hypothesis that all 
the o; are equal. Under the assumption that the errors are normally distributed, the 
probability distributions of the sums of squares can be calculated. 
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THEOREM B 


If the errors are independent and normally distributed with means 0 and variances 
o?, then SSy/o? follows a chi-square distribution with 7(J — 1) degrees of 
freedom. If, additionally, the o; are all equal to zero, then S $5 / o? follows a chi- 
square distribution with J — 1 degrees of freedom and is independent of S S,. 


Proof 


We first consider S Sy. From Theorem B of Section 6.3, 


follows a chi-square distribution with J — 1 degrees of freedom. There are / such 
sums in $5y, and they are independent of each other since the observations are 
independent. The sum of J independent chi-square random variables that each 
have J — 1 degrees of freedom follows a chi-square distribution with 7(J — 1) 
degrees of freedom. Theorem B of Section 6.3 can also be applied to SSz, noting 
that Var(Y;) = o?/J. 

We next prove that the two sums of squares are independent of each other. 
SSw is a function of the vector U, which has elements Y;; — Y;, where i = 
1,...,/ and j = 1,..., J. SSg is a function of the vector V, whose elements 
are Y;, where i = 1,..., I, since Y_ can be obtained from the Y;. Thus, it is 
sufficient to show that these two vectors are independent of each other. First, if 
ise, Vay = Y; and F, are independent since they are functions of different 
observations. Second, Y;; — Y; and Y; are independent by Theorem A of Section 
6.3. This completes the proof of the theorem. a 


The statistic 


(o SSg/U — 1) 
— SSw/U(J — 1)] 


is used to test the following null hypothesis: 





Ho: a; —05—--.—a;—0 


By Theorem A, the denominator of the F statistic has expected value equal to o°, 
and the expectation of the numerator is J(J — 1)! aa a? + o°. Thus, if the null 
hypothesis is true, the F statistic should be close to 1, whereas if it is false, the statistic 
should be larger. If the null hypothesis is false, the numerator reflects variation between 
the different groups as well as variation within groups, whereas the denominator 
reflects only variation within groups. The hypothesis is thus rejected for large values 
of F. As usual, in order to apply this test, we must know the null distribution of the 
test statistic. 


EXAMPLEA 
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THEOREM C 


Underthe assumption that the errors are normally distributed, the null distribution 
of F is the F distribution with (J — 1) and 7(J — 1) degrees of freedom. 


Proof 


The theorem follows from Theorem B and from the definition of the F distribution 
(Section 6.2), since, under Ho, F is the ratio of two independent chi-square 
random variables divided by their degrees of freedom. | 


Percentage points of the F distribution are widely tabled. It can show that, under 
the normality assumption, the F test is equivalent to the likelihood ratio test. 


We can illustrate the use of the F statistic by applying it to the tablet data from Section 
12.2. In doing so, we adopt an explicit statistical model for the variability seen in Figure 
12.1. According to this model, there is an unknown mean level associated with each 
laboratory, and the deviations from this mean level of the 10 measurements within a 
laboratory are independent, normally distributed, random variables. With the aid of 
this model, we will see whether it is plausible that the unknown laboratory means are 
all equal, so that the variability between labs displayed in Figure 12.1 is entirely due 
to chance. 

The sums of squares defined previously are calculated and presented in a table 
called the analysis of variance table: 


Source df SS MS F 
Labs 6 4125 .021 5.66 
Error 63 .231 .0037 

Total 69 .356 





In the table, SS, is the sum of squares due to error, and $55 is the sum of squares 
due to labs. MS stands for mean square and equals the sum of squares divided by 
the degrees of freedom. The column headed F gives the F statistic for testing the 
null hypothesis that there is no systematic difference among the seven labs. The F 
statistic has 6 and 63 df and a value of 5.66. This particular combination of degrees 
of freedom is not included in Table 5 of Appendix B, but upon examining the en- 
tries with 6 and 60 df, it is clear that the p-value is less than .01. We may thus 
conclude that the means of the measurements from the various labs are significantly 
different. 

Figure 12.2 is a normal probability plot of the residuals from the analysis of 
variance model (the residuals are formed by simply subtracting from the measure- 
ments of each lab the mean value for that lab). There is some indication of deviation 
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FIGURE 12.2 Normal probability plot of residuals from one-way analyses of 
variance of tablet data. 


from normality in the lower tail of the distribution, but the data do not appear grossly 
nonnormal. L| 


We now outline the procedure for the case in which the numbers of observations 
under the various treatments are not necessarily equal. The only difficulties with this 
case are algebraic; conceptually, the analysis is the same as for the case of equal sample 
sizes. Suppose that there are J; observations under treatment i, fori = 1,..., 7. The 
basic identity still holds; that is, we have 


Ji I 


I 1 
Yd qu-Yy-M M Qqy-YÉ- am -Yy 
i=1 1 


i=l j=l j= i=l 


By reasoning similar to that used here for the simple case, it can be shown that 


1 
E(SSw) =o? 3 (i - 


i=l 


I 
E(S$5) =  — Do? + V. Jia? 


i=] 


The degrees of freedom for these sums of squares are ye J; — I and J — 1, 
respectively. It may be argued, as in the proof of Theorem B, that the normalized 
sums of squares follow chi-square distributions and that the ratio of mean squares 
follows an F distribution under the null hypothesis of no treatment differences. 


12.2.2 
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To conclude this section, let us review the basic assumptions of the model and 
comment on their importance. The model is 


Yi; = u + ot + eij 
We assume the following: 


1. The ej; are normally distributed. The F test, like the ¢ test, remains approximately 
valid for moderate to large samples from moderately nonnormal distributions. 

2. The error variance, o°, is constant. In many applications, the error variances may 
be different in different groups. For example, Figure 12.1 suggests that some 
labs may be more precise in their measurements than others. Fortunately, if there 
are an equal number of observations in each group, the F test is not strongly 
affected. 

3. The ¢;; are independent. This assumption is very important, both for normal theory 
and for the nonparametric analysis we will present later. 


The Problem of Multiple Comparisons 


The application of the F test in Example A in Section 12.2.1 has an anticlimactic 
character. We concluded that the means of measurements from different labs are not 
all equal, but the test gives no information about how they differ, in particular about 
which pairs are significantly different. In many applications, the null hypothesis is 
a "straw man" that is not seriously entertained. Real interest may be focused on 
comparing pairs or groups of treatments and estimating the treatment means and 
their differences. A naive approach would be to compare all pairs of treatment means 
using f tests. The difficulty with such a procedure was pointed out in the section on 
experimental design in Chapter 11: Although each individual comparison would have 
a type I error rate of a, the collection of all comparisons considered simultaneously 
would not. In this section, we discuss two solutions to this problem— Tukey's method 
and the Bonferroni method. More discussion can be found in Miller (1981). 


12.2.2.1 Tukey's Method  Tukey's method is used to construct confidence inter- 
vals for the differences of all pairs of means in such a way that the intervals simulta- 
neously have a set coverage probability. The duality of confidence intervals and tests 
can then be used to determine which particular pairs are significantly different. 

If the sample sizes are all equal and the errors are normally distributed with 
a constant variance, the centered sample means, Y; — wi, are independent and 
normally distributed with means 0 and variances o?/J, which may be estimated by 
s? / J. Tukey's method is based on the probability distribution of the random variable 


x (F. — hi) — (Yi. — pis) 
hi sp/VI 
where the maximum is taken over all pairs i; and i». This distribution is called the 


studentized range distribution with parameters Z (the number of samples being 
compared) and 7 (J — 1) (the degrees of freedom in s,). The upper 100a percentage 
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EXAMPLEA 


point of the distribution is denoted by q;.;( (o). Now, 


Sp 


JP for all i; and i 


P ea — Bi) — Vin. — Mi)| € qiru- (0) 


= Wu a ips (Vs. is J) = A 
P [mes Py. — Ha) — (Yi. — Me) < qs) 
By definition, this latter probability equals 1 — a. The idea is that all the differences 
are less than some number if and only if the largest difference is. The above proba- 
bility statement can be converted directly into a set of confidence intervals that hold 
simultaneously for all differences u; — Mi, with confidence level 100(1 — w)%. The 
intervals are 


— — 5$ 
(Yi, ES Y) Æ qru-D0) 


By the duality of confidence intervals and hypothesis tests, if the 100(1 — o)96 
confidence interval for (Yi. — Ya.) does not include zero—that is, if 


Sp 
VI 


the null hypothesis that there is no difference between u;, and u;, may be rejected 
at level a. Also, all such hypothesis tests considered collectively have level a. 


IE; = Y| > qrig-p(() 


We can illustrate Tukey’s method by applying it to the tablet data of Section 12.2. We 
list the labs in decreasing order of the mean of their measurements: 


Lab Mean 


4.062 
4.003 
3.998 
3.997 
3.957 
3.955 
3.920 


BDNNNA We 





Sp is the square root of the mean square for error in the analysis of variance table of 
Example A of Section 12.2.1: s, = .06. The degrees of freedom of the appropriate 
studentized range distribution are 7 and 63. Using 7 and 60 df in Table 6 of Appendix 
B as an approximation, q7,50(.05) = 4.31, two of the means in the preceding table are 
significantly different at the .05 level if they differ by more than 


s 
q.5005) 77 = .082 

The mean from lab 1 is thus significantly different from those from labs 4, 5, and 6; 

The mean of lab 3 is significantly greater than that of lab 4. No other comparisons 

are significant at the .05 level. 


EXAMPLEA 
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At the 95% confidence level, the other differences in mean level that are seen 
in Figure 12.1 cannot be judged to be significantly different from zero. Although 
differences between these labs must certainly exist, we cannot reliably establish the 
signs of the differences. 

It is interesting to note that a price is paid here for performing multiple compar- 
isons simultaneously. If separate t tests had been conducted using the pooled sample 
variance, labs would have been declared significantly different if their means had 


differed by more than 
/2 
t63(.025)s,, 7 = .053 E 


12.2.2.2 The Bonferroni Method The Bonferroni method was briefly introduced 
in Section 11.4.8. The idea is very simple. If k null hypotheses are to be tested, a 
desired overall type I error rate of at most œ can be guaranteed by testing each null 
hypothesis at level æ/k. Equivalently, if k confidence intervals are each formed to 
have confidence level 100(1 — a/k)%, they all hold simultaneously with confidence 
level at least 100(1 — a)%. 

The method is simple and versatile and, although crude, gives surprisingly good 
results if k is not too large. 


To apply the Bonferroni method to the data on tablets, we note that there are k = 


(2) = 21 pairwise comparisons among the seven labs. A set of simultaneous 95% 


confidence intervals for the pairwise comparisons is 


= Z 153(.025/21) 
Bu =F ges = 
v5 
Special tables for such values of the ¢ distribution have been prepared; from Table 7 


of Appendix B, We find 
t60 : = 3 16 


which we will use as an approximation to t63 (.025/21), giving confidence intervals 





(Yi. = Yi) jan ted .085 





that we will use as an approximation. Given the crude nature of the Bonferroni method, 
these are surprisingly close to the intervals produced by Tukey's method, which have 
a half-width of .082. Here, too, we conclude that lab 1 produced significantly higher 
measurements than those of labs 4, 5, and 6. m 


A significant advantage of the Bonferroni method over Tukey's method is that it 
does not require equal sample sizes in each treatment. 
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12.2.3 A Nonparametric Method—The Kruskal-Wallis Test 


The Kruskal-Wallis test is a generalization of the Mann-Whitney test that is conceptu- 
ally quite simple. The observations are assumed to be independent, but no particular 
distributional form, such as the normal, is assumed. The observations are pooled 
together and ranked. Let 


Rij = the rank of Y;; in the combined sample 


Let 


Ji 
Ri. = Rij 


j=l 


be the average rank in the ith group. Let 


1 Ji 
Rex 2M 
N 


where N is the total number of observations. As in the analysis of variance, let 


1 
$85 2 35 (t — RY 


i=l 


be a measure of the dispersion of the R;,. SSg may be used to test the null hypoth- 
esis that the probability distributions generating the observations under the various 
treatments are identical. The larger S Sg is, the stronger is the evidence against the 
null hypothesis. The exact null distribution of this statistic for various combinations 
of J and J; can be enumerated, as for the Mann-Whitney test. The null distribution 
is commonly available in computer packages. Tables are given in Lehmann (1975) 
and in references therein. For J = 3 and J; > 5 or J > 3 and J; > 4, a chi- 
square approximation to a normalized version of S55 is fairly accurate. Under the 
null hypothesis that the probability distributions of the Z groups are identical, the 
statistic 


12 
= o SS; 
N(N + 1) 


is approximately distributed as a chi-square random variable with / — 1 degrees of 
freedom. The value of K can be found by running the ranks through an analysis of 
variance program and multiplying SS, by 12/[N(N + 1)]. It can be shown that K 
can also be expressed as 


= OIN (Y: AR; j —3(N 4 1) 


which is easier to compute by hand. 


EXAMPLEA 


23 


12.3.1 


12.3 The Two-Way Layout 489 


For the data on the tablets, K = 29.51. Referring to Table 3 of Appendix B with 6 df, 
we see that the p-value is less than .005. The nonparametric analysis, too, indicates 
that there is a systematic difference among the labs. E 


Multiple comparison procedures for nonparametric methods are discussed in 
detail in Miller (1981). The Bonferroni method requires no special discussion; it can 
be applied to all comparisons tested by Mann-Whitney tests. 

Like the Mann-Whitney test, the Kruskal-Wallis test makes no assumption of 
normality and thus has a wider range of applicability than does the F test. It is 
especially useful in small-sample situations. Also, because data are replaced by their 
ranks, outliers will have less influence on this nonparametric test than on the F test. 
In some applications, the data consist of ranks—for example, in a wine tasting, judges 
usually rank the wines—which makes the use of the Kruskal-Wallis test natural. 


The Two-Way Layout 


A two-way layout is an experimental design involving two factors, each at two or 
more levels. The levels of one factor might be various drugs, for example, and the 
levels of the other factor might be genders. If there are J levels of one factor and 
J of the other, there are 7 x J combinations. We will assume that K independent 
observations are taken for each of these combinations. (The last section of this chapter 
will outline the advantages of such an experimental design.) 

The next section defines the parameters that we might want to estimate from a 
two-way layout. Later sections present statistical methods based on normal theory 
and nonparametric methods. 


Additive Parametrization 


To develop and illustrate the ideas in this section, we will use a portion of the data 
contained in a study of electric range energy consumption (Fechter and Porter 1978). 
The following table shows the mean number of kilowatt-hours used by three electric 
ranges in cooking on each of three menu days (means are over several cooks). 








Menu Day Range 1 Range 2 Range 3 
1 3.97 4.24 4.44 
2 2.39 2.61 2.82 
3 2.76 2.75 3.01 





We wish to describe the variation in the numbers in this table in terms of the 
effects of different ranges and different menu days. Denoting the number in the ith 
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row and jth column by Y;;, we first calculate a grand average 


3 3 
&-Y.-. o» hyp 3i22 
i=l j=l 


This gives a measure of typical energy consumption per menu day. 
The menu day means, averaged over the ranges, are 


Y; = 4.22 
Y» = 2.61 
Y, = 2.84 


We will define the differential effect of a menu day as the difference between the 
mean for that day and the overall mean; we will denote these differential effects by 
Ĝi, Where i = 1, 2, or 3. 


à; =Y, — Y, = 1.00 
à; = Ya — Y = —.61 
63 = Yi — Y — —38 


(Note that, except for rounding error, the œ; would sum to zero.) In words, on menu 
day 1, 1 kwh more than the average is consumed, and so on. 
The range means, averaged over the menu days, are 


Yı = 3.04 
Y, = 3.20 
Y3 = 3.42 
The differential effects of the ranges are 
ĝi = Yı — Y. = —.18 
po = Y2 - Y, = —.02 


b3 = Y3- Y. = .20 


The effects of the ranges are smaller than the effects of the menu days. 

The preceding description of the values in the table incorporates an overall aver- 
age level plus differential effects of ranges and menu days. This is a simple additive 
model. 


Y; = À rà +Ê; 
Here we use f, j to denote the fitted or predicted values of Y;; from the additive model. 


According to this additive model, the differences between the three ranges are the 
same on all menu days. For example, for i — 1, 2, 3, 


Y, — În = (A +â; + Bi) — (A + à; + Bo) 
= ĝi — bp 
Figure 12.3 shows that this is not quite the case. If the differences were exactly the same 
on all menu days, the three lines would be exactly parallel. The differences between 


menu days 1 and 2 appear nearly the same—the lines are nearly parallel. But on menu 
day 3, the difference between ranges 2 and 3 increased and the difference between 
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FIGURE 12.3 Plot of energy consumption versus menu day for three electric 
ranges. The dashed line corresponds to range 3, the dotted line to range 2, and the 
solid line to range 1. 


ranges 1 and 2 decreased. This phenomenon is called an interaction between menu 
days and ranges—it is as if there were something about menu day 3 that especially 
affected adversely the energy consumption of range 1 relative to range 2. 


The differences of the observed values and the fitted values, Y;; — Y;;, are the 
residuals from the additive model and are shown in the following table: 








Menu Day Range 1 Range 2 Range 3 
1 —.07 .04 .02 
2 —.04 .02 .01 
3 .10 —.07 —.03 





The residuals are small relative to the main effects, with the possible exception of 
those for menu day 3. 

Interactions can be incorporated into the model to make it fit the data exactly. 
The residual in cell ij is 


^ 


Wb d c eT lE etj er 








Note that 
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12.3.2 


For example, 


3 
di —Yi -Y; TY) 


3 
, bij 
i=l 





In the preceding table of residuals, the row and column sums are not exactly zero 
because of rounding errors. The model 


Y; = b + âi +Ê; + ôi; 


thus fits the data exactly; it is merely another way of expressing the numbers listed in 
the table. 

An additive model is simple and easy to interpret, especially in the absence of 
interactions. Transformations of the data are sometimes used to improve the adequacy 
of an additive model. The logarithmic transformation, for example, converts a mul- 
tiplicative model into an additive one. Transformations are also used to stabilize the 
variance (to make the variance independent of the mean) and to make normal theory 
more applicable. There is no guarantee, of course, that a given transformation will 
accomplish all these aims. 

The discussion in this section has centered on the parametrization and interpre- 
tation of the additive model as used in the analysis of variance. We have not taken 
into account the possibility of random errors and their effects on the inferences about 
the parameters, but will do so in the next section. 


Normal Theory for the Two-Way Layout 


In this section, we will assume that there are K > 1 observations per cell in a two-way 
layout. A design with an equal number of observations per cell is called balanced. 
Let Y;;, denote the kth observation in cell ij; the statistical model is 


Yi = u c ei + Bj + ij + Eijk 


We will assume that the random errors, ¢;;,, are independent and normally distributed 
with mean zero and common variance o?. Thus, E(Yijy) = p d ot + Bj + sy The 
parameters satisfy the following constraints: 
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We now find the mle’s of the unknown parameters. Since the observations in cell 
ij are normally distributed with mean u + a; + Pj + 0j; and variance o*, and since 
all the observations are independent, the log likelihood is 


IJK 1 Tee 
= 2 
l=- 5 logQ10) — 7 J J ) (Yi — u — ai — Bj — Y. 


i=l j=l k=l 





Maximizing the likelihood subject to the constraints given above yields the following 
estimates (see Problem 17 at the end of this chapter): 





A=yY.. 

& = Y;.—Y.., 121. 
BEFLAFG elus 
ô = Y; — Y-Y; +Y 


as is expected from the discussion in Section 12.3.1. 
Like one-way analysis of variance, two-way analysis of variance is conducted 
by comparing various sums of squares. The sums of squares are as follows: 


1 
$$,— JK X (Y,, - Y. 


i-l 


J 
$8; 2 IK X (Y - Y.? 





J 
SSagp = K 2. (Kia Y= Ye se Y 


The sums of squares satisfy this algebraic identity: 
SSror = SS4 + SSg + SSag + SSE 


This identity may be proved by writing 





Y= y 2 (Yi — Yi) + (Yi, Y.) c QQ — Y.) 
TQNg-Y;i-Y;j-ctY.) 





and then squaring both sides, summing, and verifying that the cross products vanish. 
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The following theorem gives the expectations of these sums of squares. 


THEOREMA 
Under the assumption that the errors are independent with mean zero and variance 
2 
0, 
I 
E(SS4) = I — Do? + JK X` o2 
i=l 
J 
E(SSp) = (J Da? + IK X £f 
j=l 
I 
D 2 
E(S$45 20 - DU —- Do? + K M M, 
i=l j=l 
E(SSg) = IJ(K = Do? 
Proof 


The results for S $4, S Sg, and S Sg follow from Lemma A of Section 12.2.1. 
Applying the lemma to DS we have 


E(SSror) =E X` Y (Yi — Y. 


i=l j=l k= 


I K 
= (UK - Do? +> 3 M @ Bj +85) 


ij 
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In the last step, we used the constraints on the parameters. For example, the cross 
product involving o; and f; is 


Eina (Se) (Ea) 


The desired expression for E (SS4pg) now follows, since 


E(SSror) = E(SS4) + E(SSg) + E(SSAg) + E(SSg) o 


The distributions of these sums of squares are given by the following theorem. 


EXAMPLEA 
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THEOREM B 


Assume that the errors are independent and normally distributed with means zero 
and variances o°. Then 


a. SSz/o? follows a chi-square distribution with 7J (K — 1) degrees of freedom. 
b. Under the null hypothesis 


Ile Ca em 0, d = Ms ee 


SS4/o? follows a chi-square distribution with 7 — 1 degrees of freedom. 
c. Under the null hypothesis 


Hz: p; e (0) ile 


SS,/o7 follows a chi-square distribution with J — 1 degrees of freedom. 
d. Under the null hypothesis 


lelans Oi; = O, Ne nan ll lf ean and! 


SSap/o~ follows a chi-square distribution with (I — 1)(J — 1) degrees of 
freedom. 
e. The sums of squares are independently distributed. 


Proof 


We will not give a full proof of this theorem. The results for $S4, SSz, and SS; 
follow from arguments similar to those used in proving Theorem B of Section 
12.2.1. The result for $545 requires some additional argument. L| 


F tests ofthe various null hypotheses are conducted by comparing the appropriate 
sums of squares to the sum of squares for error, as was done for the simpler case of the 
one-way layout. The mean squares are the sums of squares divided by their degrees of 
freedom and the F statistics are ratios of mean squares. When such a ratio is substan- 
tially larger than 1, the presence of an effect is suggested. Note, for example, that from 
Theorem A, E(MS4) = o? - (JK/(1 — 1)) $5; œ? and that E (M Sg) = o°. So if the 
ratio MS,/MS; is large, it suggests that some of the œ; are nonzero. The null distri- 
bution of this F statistic is the F distribution with (Z — 1) and ZJ (K — 1) degrees of 
freedom, and knowing this null distribution allows us to assess the significance of the 
ratio. 


As an example, we return to the experiment on iron retention discussed in Section 
11.2.1.1. In the complete experiment, there were J = 2 forms of iron, J = 3 dosage 
levels, and K — 18 observations per cell. In Section 11.2.1.1, we discussed a loga- 
rithmic transformation of the data to make it more nearly normal and to stabilize the 
variance. Figure 12.4 shows boxplots of the data on the original scale; boxplots of the 
log data are given in Figure 12.5. The distribution of the log data is more symmetrical, 
and the interquartile ranges are less variable. Figure 12.6 is a plot of cell standard 
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FIGURE 12.4 Boxplots of iron retention for two forms of iron at three dosage levels. 
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FIGURE 12.5  Boxplots of log data on iron retention. 
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FIGURE 12.6 Plot of cell standard deviations versus cell means for iron retention 
data. 


deviations versus cell means for the untransformed data; it shows that the error vari- 
ance increases with the mean. Figure 12.7 is a plot of cell standard deviations versus 
means for the log data; it shows that the transformation is successful in stabilizing 
the variance. Note that one of the assumptions of Theorem B is that the errors have 
equal variance. 
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FIGURE 12.7 Plot of cell standard deviations versus cell means for log data on iron 
retention. 
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FIGURE 12.8 Plot of cell means of log data versus dosage level. The dashed line 
corresponds to Fe?* and the solid line to Fe?*. 


Figure 12.8 is a plot of the cell means of the transformed data versus the dosage 
levels for the two forms of iron. It suggests that Fe?* may be retained more than Fe?* . If 
there is no interaction, the two curves should be parallel except for random variation. 
This appears to be roughly the case, although there is a hint that the difference in 
retention of the two forms of iron increases with dosage level. To check this, we will 
perform a quantitative test for interaction. 

In the following analysis of variance table, SS4 is the sum of squares due to 
the form of iron, SSz is the sum of squares due to dosage, and SS,, is the sum of 
squares due to interaction. The F statistics were found by dividing the appropriate 
mean square by the mean square for error. 


Analysis of Variance Table 








Source df SS MS F 
Iron form 1 2.074 2.074 5.99 
Dosage 2 15.588 7.794 22.53 
Interaction 2 .810 .405 1.17 
Error 102 35.296 .346 

Total 107 53.768 





To test the effect of the form of iron, we test 


Ay: a) —05—0 
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using the statistic 
_ SSirow/1 
SS /102 
From computer evaluation of the F distribution with 1 and 102 df, the p-value is less 


than .025. There is an effect due to the form of iron. An estimate of the difference 
Ql, — 05 is 


= 5.99 


Yi, — Yo, — 28 


and a confidence interval for the difference may be obtained by noting that Y; and 
Y, are uncorrelated since they are averages over different observations and that 


— —— o? 
Var(Y;.) = Var(Y?.) = — 





JK 
Thus, 
var,- Y.) = 2 
JK 
Estimating o? by the mean square for error, Var(Y;,. — Y») is estimated by 
Bow z = = 0128 


A confidence interval can be constructed using the ¢ distribution with JJ(K — 1) 
degrees of freedom. The interval is of the form 


(Yi, = Yo.) E tiia -n(/2)sy, F, 


There are 102 df; to form a 95% confidence interval we use t)29(.025) = 1.98 
from Table 4 of Appendix B as an approximation, producing the interval .28 + 
1.98 .0128, or (.06, .5). 

Recall that we are working on a log scale. The additive effect of .28 on the log 
scale corresponds to a multiplicative effect of e?* = 1.32 on a linear scale and the 
interval (.06, .50) transforms to (e°°, e°°), or (1.06, 1.65). Thus, we estimate that 
Fe?* increases retention by a factor of 1.32, and the uncertainty in this factor is 
expressed in the confidence interval (1.06, 1.65). 

The F statistic for testing the effect of dosage is significant, but this effect is 
expected and is not of major interest. 

To test the hypothesis H45 which states that there is no interaction, we consider 
the following F statistic: 





_ SSipfG — D — D 
^— SSe/JIJ(K —1) 


From computer evaluation of the F distribution with 2 and 102 df, the p-value is .31, 
so there is insufficient evidence to reject this hypothesis. Thus, the deviation of the 
lines of Figure 12.8 from parallelism could easily be due to chance. 

In conclusion, it appears that there is a difference of 6-65% in the ratio of 
percentage retained between the two forms of iron and that there is little evidence 
that this difference depends on dosage. E 


= 1.17 
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12.3.3 


Randomized Block Designs 


Randomized block designs originated in agricultural experiments. To compare the 
effects of J different fertilizers, J relatively homogeneous plots of land, or blocks, 
are selected, and each is divided into 7 plots. Within each block, the assignment of 
fertilizers to plots is made at random. By comparing fertilizers within blocks, the 
variability between blocks, which would otherwise contribute “noise” to the results, 
is controlled. This design is a multisample generalization of a matched-pairs design. 

A randomized block design might be used by a nutritionist who wants to compare 
the effects of three different diets on experimental animals. To control for genetic 
variation in the animals, the nutritionist might select three animals from each of 
several litters and randomly determine their assignments to the diets. Randomized 
block designs are used in many areas. If an experiment is to be carried out over a 
substantial period of time, the blocks may consist of stretches of time. In industrial 
experiments, the blocks are often batches of raw material. 

Randomization helps ensure against unintentional bias and can form a basis 
for inference. In principle, the null distribution of a test statistic can be derived by 
permutation arguments, just as we derived the null distribution of the Mann-Whitney 
test statistic in Section 11.2.3. Parametric procedures often give a good approximation 
to the permutation distribution. 

As a model for the responses in the randomized block design, we will use 


Yij = w+; + B; + &ij 


where o; is the differential effect of the ith treatment, B; is the differential effect 
of the jth block, and the ¢;; are independent random errors. This is the model of 
Section 12.3.2 but with the additional assumption of no interactions between blocks 
and treatments. Interest is focused on the o;. 

From Theorem A of Section 12.3.2, if there is no interaction, 


E(MS,) = 0° m d 


I 
i= 


J 
J- 135 


j=l 


E(MSg) = o? + 


E(MSag) = o? 


Thus, o? can be estimated from M S45. Also, since these mean squares are inde- 
pendently distributed, F tests can be performed to test H4 or Hg. For example, to 
test 


H3: a; = 0, i= 1,...,J/ 
this statistic can be used: 
u MS4 
—. MSap 


From Theorem B in Section 12.3.2, under H4, the statistic follows an F distribution 
with J — 1 and (J — 1)(J — 1) degrees of freedom. Hg may be tested similarly but is 
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not usually of interest. Note that if, contrary to the assumption, there is an interaction, 
then 


I J 
eet NS 2 
E(MSag) = 0? + Tao 5 5 à, 


i=l j=l 


MS 4p will tend to overestimate o°. This will cause the F statistic to be smaller than 
it should be and will result in a test that is conservative; that is, the actual probability 
of type I error will be smaller than desired. 


Let us consider an experimental study of drugs to relieve itching (Beecher 1959). Five 
drugs were compared to a placebo and no drug with 10 volunteer male subjects aged 
20-30. (Note that this set of subjects limits the scope of inference; from a statistical 
point of view, one cannot extrapolate the results of the experiment to older women, 
for example. Any such extrapolation could be justified only on grounds of medical 
judgment.) Each volunteer underwent one treatment per day, and the time-order was 
randomized. Thus, individuals were "blocks." The subjects were given a drug (or 
placebo) intravenously, and then itching was induced on their forearms with cowage, 
an effective itch stimulus. The subjects recorded the duration of the itching. More 
details are in Beecher (1959). The following table gives the durations of the itching 
(in seconds): 





No Papa- Amino- Pento-  Tripelen- 
Subject Drug Placebo verine Morphine phylline barbital namine 








BG 174 263 105 199 141 108 141 
JF 224 213 103 143 168 341 184 
BS 260 231 145 113 78 159 125 
SI 255 291 103 225 164 135 227 
BW 165 168 144 176 127 239 194 
TS 237 121 94 144 114 136 155 
GM 191 137 35 87 96 140 121 
SS 100 102 133 120 222 134 129 
MU 115 89 83 100 165 185 79 
OS 189 433 237 173 168 188 317 
Average 191.0 204.8 118.2 148.0 144.3 176.5 167.2 


Figure 12.9 shows boxplots of the responses to the six treatments and to the 
control (no drugs). Although the boxplot is probably not the ideal visual display of 
these data, since it takes no account of the blocking, Figure 12.9 does show some 
interesting aspects of the data. There is a suggestion that all the drugs had some effect 
and that papaverine was the most effective. There is a lot of scatter relative to the 
differences between the medians, and there are some outliers. It is interesting that 
the placebo responses have the greatest spread; this might be because some subjects 
responded to the placebo and some did not. 
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FIGURE 12.9 Boxplots of durations of itching under seven treatments. 


We next construct an analysis of variance table for this experiment: 








Source df SS MS F 
Drugs 6 53013 8835 2.85 
Subjects 9 103280 11476 371 
Interaction 54 167130 3095 

Total 69 323422 





The F statistic for testing differences between drugs is 2.85 with 6 and 54 df, corre- 
sponding to a p-value less than .025. The null hypothesis that there is no difference 
between subjects is not experimentally interesting. 

Figure 12.10 is a probability plot of the residuals from the two-way analysis of 
variance model. The residual in cell ij is 


rij = Vy — i — à; — B; 
=¥,-Y,-Y,+Y., 


There is a slightly bowed character to the probability plot, indicating some skewness 
in the distribution of the residuals. But because the F test is robust against moderate 
deviations from normality, we should not be overly concerned. 

Tukey’s method may be applied to make multiple comparisons. Suppose that 
we want to compare the drug means, Y;,..., Y4, (I = 7). These have expectations 
IL +a, where i = 1,...,7, and each is an average over J = 10 independent ob- 
servations. The error variance is estimated by MS4g with 54 df. Simultaneous 95% 


12.3.4 


12.3 The Two-Way Layout 503 








150 L . 
100 } 
$ so f 
$ E 
as) a 
m .Á 
$ OF e 
6 yw 
r 
d 
-50 + Ea 
100 L L L L L L 
-3  -2 -i 0 1 2 3 


Normal quantiles 


FIGURE 12.10 Normal probability plot of residuals from two-way analysis of 
variance of data on duration of itching. 


confidence intervals for all differences between drug means have half-widths of 


JI V 10 


= 75.8 


[Here we have used q7,60(.05) from Table 6 of Appendix B as an approximation to 
q1,54(.05).] Examining the table of means, we see that, at the 95% confidence level, 
we can conclude only that papaverine achieves a reduction of itching over the effect 
of a placebo. E 


A Nonparametric Method—Friedman's Test 


This section presents a nonparametric method for the randomized block design. Like 
other nonparametric methods we have discussed, Friedman's test relies on ranks and 
does not make an assumption of normality. The test is very simple. Within each of 
the J blocks, the observations are ranked. To test the hypothesis that there is no effect 
due to the factor corresponding to treatments (7), the following statistic is calculated: 


I 
SSa =J X (RS, - R) 
i=l 
just as in the ordinary analysis of variance. Under the null hypothesis that there 
is no treatment effect and that the only effect is due to the randomization within 
blocks, the permutation distribution of the statistic can, in principle, be calculated. 
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For sample sizes such as that of the itching experiment, a chi-square approximation 
to this distribution is perfectly adequate. The null distribution of 


I 
Q= i5 2, 


is approximately chi-square with J — 1 degrees of freedom. 


To carry out Friedman's test on the data from the experiment on itching, we first 
construct the following table by ranking durations of itching for each subject: 





No Amino-  Pento-  Tripelen- 
Drug Placebo Papaverine Morphine  phylline barbital namine 








BG 5 T 1 6 3.5 2 3:5 
JF 6 5 1 2 3 7 4 
BS 7 6 4 2 1 5 3 
SI 6 7 1 4 3 2 5 
BW 3 4 2 5 1 7 6 
TS 7 3 1 5 2 4 6 
GM 7 5 1 2 3 6 4 
SS 1 2 5 3 7 6 4 
MU 5 3 2 4 6 7 1 
OS 4 7 5 2 1 3 6 
Average 5.10 4.90 2.30 3.50 3.05 4.90 4.25 


Note that we have handled ties in the usual way by assigning average ranks. From the 
preceding table, no drug, placebo, and pentobarbitol have the highest average ranks. 
From these average ranks, we find R = 4, )°(R;, — R y? = 6.935 and Q = 14.86. 
From Table 3 of Appendix B with 6 df, the p-value is less than .025. The nonparametric 
analysis also rejects the hypothesis that there is no drug effect. E 





Procedures for using Friedman’s test for multiple comparisons are discussed 
by Miller (1981). When these methods are applied to the data from the experiment 
on itching, the conclusions reached are identical to those reached by the parametric 
analysis. 


Concluding Remarks 


The most complicated experimental design considered in this chapter was the two- 
way layout; more generally, a factorial design incorporates several factors with one 
or more observations per cell. With such a design, the concept of interaction be- 
comes more complicated—there are interactions of various orders. For instance, in a 
three-factor experiment, there are two-factor and three-factor interactions. It is both 
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interesting and useful that the two-factor interactions in a three-way layout can be 
estimated using only one observation per cell. 

To gain some insight into why factorial designs are effective, we can begin by 
considering a two-way layout, with each factor at five levels, no interaction, and one 
observation per cell. With this design, comparisons of two levels of any factor are 
based on 10 observations. A traditional alternative to this design is to do first an 
experiment comparing the levels of factor A and then another experiment comparing 
the levels of factor B. To obtain the same precision as is achieved by the two-way 
layout in this case, 25 observations in each experiment, or a total of 50 observations, 
would be needed. The factorial design achieves its economy by using the same ob- 
servations to compare the levels of factor A as are used to compare the levels of 
factor B. 

The advantages of factorial designs become greater as the number of factors 
increases. For example, in an experiment with four factors, with each factor at two 
levels (which might be the presence or absence of some chemical, for example) and one 
observation per cell, there are 16 observations that may be used to compare the levels 
of each factor. Furthermore, it can be shown that two- and three-factor interactions 
can be estimated. By comparison, if each of the four factors were investigated in a 
separate experiment, 64 observations would be required to attain the same precision. 

As the number of factors increases, the number of observations necessary for 
a factorial experiment with only one observation per cell grows very rapidly. To 
decrease the cost of an experiment, certain cells, designated in a systematic way, can 
be left empty, and the main effects and some interactions can still be estimated. Such 
arrangements are called fractional factorial designs. 

Similarly, with a randomized block design, the individual blocks may not be 
able to accommodate all the treatments. For example, in a chemical experiment that 
compares a large number of treatments, the blocks of the experiment, batches of raw 
material of uniform quality, may not be large enough. In such situations, incomplete 
block designs may be used to retain the advantages of blocking. 

The basic theoretical assumptions underlying the analysis of variance are that the 
errors are independent and normally distributed with constant variance. Because we 
cannot fully check the validity of these assumptions in practice and can probably detect 
only gross violations, it is natural to ask how robust the procedures are with respect 
to violations of the assumptions. It is impossible to give a complete and conclusive 
answer to this question. Generally speaking, the independence assumption is probably 
the most important (and this is true for nonparametric procedures as well). The F test 
is robust against moderate departures from normality; if the design is balanced, the 
F test is also robust against unequal error variance. 

For further reading, Box, Hunter, and Hunter (1978) is recommended. 


12.5 Problems 


1. Simulate observations like those of Figure 12.1 under the null hypothesis of no 
treatment effects. That is, simulate seven batches of ten normally distributed 
random numbers with mean 4 and variance .0037. Make parallel boxplots of 
these seven batches like those of Figure 12.1. Do this several times. Your figures 
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10. 


11. 


12. 


13. 


14. 


display the kind of variability that random fluctuations can cause; do you see any 
pairs of labs that appear quite different in either mean level or dispersion? 


. Verify that if J = 2, the estimate s of Theorem A of Section 11.2.1 is the s 


given in Section 12.2.1. 


. For a one-way analysis of variance with J = 2 treatment groups, show that the 


F statistic is t°, where t is the usual t statistic for a two-sample case. 


. Prove the analogues of Theorems A and B in Section 12.2.1 for the case of 


unequal numbers of observations in the cells of a one-way layout. 


. Derive the likelihood ratio test for the null hypothesis of the one-way layout, and 


show that it is equivalent to the F test. 


. Prove this version of the Bonferroni inequality: 


(Use Venn diagrams if you wish.) In the context of simultaneous confidence 
intervals, what is A; and what is A7? 


. Show that, as claimed in Theorem B of Section 12.2.1, $$5/0? ~ x? ,. 


. Form simultaneous confidence intervals for the difference of the mean of lab 1 


and those of labs 4, 5, and 6 in Example A of Section 12.2.2.1. 


. Compare the tables of the ¢ distribution and the studentized range in Appendix 


B. For example, consider the column corresponding to t 95; multiply the numbers 
in that column by 4/2 and observe that you get the numbers in the column t — 2 
of the table of q 99. Why is this? 


Suppose that in a one-way layout there are 10 treatments and seven observations 
under each treatment. What is the ratio of the length of a simultaneous confidence 
interval for the difference of two means formed by Tukey's method to that of one 
formed by the Bonferroni method? How do both of these compare in length to 
an interval based on the ¢ distribution that does not take account of multiple 
comparisons? 


Consider a hypothetical two-way layout with four factors (A, B, C, D) each at 
three levels (I, III, HI). Construct a table of cell means for which there is no 
interaction. 


Consider a hypothetical two-way layout with three factors (A, B, C) each at two 
levels (I, II). Is it possible for there to be interactions but no main effects? 


Show that for comparing two groups the Kruskal-Wallis test is equivalent to the 
Mann-Whitney test. 


Show that for comparing two groups Friedman's test is equivalent to the sign 
test. 


15. 


16. 


17. 


18. 


19. 


20. 
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Show the equality of the two forms of K given in Section 12.2.3: 


12 numm. aes 
Ko 2 eR 
N(N +1) >, i ; 
— IR pD2U1) 
~ N(N +1) a PTS 


Prove the sums of squares identity for the two-way layout: 
SSTor => SS 4 + SSp + SSAB + SSe 


Find the mle's of the parameters a;, Bj, 6;;, and u of the model for the two-way 
layout. 


The table below gives the energy use of five gas ranges for seven menu days. (The 
units are equivalent kilowatt-hours; .239 kwh — 1 ft? of natural gas.) Estimate 
main effects and discuss interaction, paralleling the discussion of Section 12.3. 





Menu Day Range 1 Range 2 Range 3 Range 4 Range 5 





1 8.25 8.26 6.55 8.21 6.69 
2 5.12 4.81 3.87 4.81 3.99 
3 3.32 4.37 3.76 4.67 4.37 
4 8.00 6.50 5.38 6.51 5.60 
3 6.97 6.26 5.03 6.40 5.60 
6 7.65 5.84 5.23 6.24 5.73 
7 7.86 7.31 5.87 6.64 6.03 





Develop a parametrization for a balanced three-way layout. Define main effects 
and two-factor and three-factor interactions, and discuss their interpretation. What 
linear constraints do the parameters satisfy? 


This problem introduces a random effects model for the one-way layout. Con- 
sider a balanced one-way layout in which the Z groups being compared are 
regarded as being a sample from some larger population. The random effects 
model is 


Yij = u + Ai + &ij 


where the A; are random and independent of each other with E(A;) = 0 and 
Var(A;) = 9i. The ¢;; are independent of the A; and of each other, and E(s;;) = 0 
and Var(e;;) = o2. 

To fix these ideas, we can consider an example from Davies (1960). The 
variation of the strength (coloring power) of a dyestuff from one manufactured 
batch to another was studied. Strength was measured by dyeing a square of cloth 
with a standard concentration of dyestuff under carefully controlled conditions 
and visually comparing the result with a standard. The result was numerically 
scored by a technician. Large samples were taken from six batches of a dyestuff; 
each sample was well mixed, and from each six subsamples were taken. These 
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21. 


36 subsamples were submitted to the laboratory in random order over a period of 
several weeks for testing as described. The percentage strengths of the dyestuff 
are given in the following table. 





Subsample Subsample Subsample Subsample Subsample Subsample 


Batch 1 2 3 4 5 6 
I 94.5 93.0 91.0 89.0 96.5 88.0 
II 89.0 90.0 92.5 88.5 91.5 91.5 
Il 88.5 93.5 93.5 88.0 92.5 91.5 
IV 100.0 99.0 100.0 98.0 95.0 97.5 
V 91.5 93.0 90.0 92.5 89.0 91.0 
VI 98.5 100.0 98.0 100.0 96.5 98.0 





There are two sources of variability in these numbers: batch-to-batch variabil- 
ity and measurement variability. It is hoped that variability between subsamples 
has been eliminated by the mixing. We will consider the random effects model, 


Yi; = u + Ai + &ij 


Here, u is the overall mean level, A; is the random effect of the ith batch, and ¢;; 
is the measurement error on the jth subsample from the ith batch. We assume 
that the A; are independent of each other and of the measurement errors, with 
E(A;) = 0 and Var(A;) = a. The ej; are assumed to be independent of each 
other and to have mean 0 and variance o2. Thus, 


Var(Y;j) = 02 +07 


Large variability in the Y;; could be caused by large variability among batches, 
large measurement error, or both. The former could be decreased by changing the 
manufacturing process to make the batches more homogeneous, and the latter by 
controlling the scoring process more carefully. 


a. Show that for this model 


E(M Sw) = o? 
E(MSg) = o? + Jo? 


and that therefore o? and o7 can be estimated from the data. Calculate these 
estimates. 

b. Suppose that the samples had not been mixed, but that duplicate measurements 
had been made on each subsample. Formulate a model that also incorporates 
variability between subsamples. How could the parameters of this model be 
estimated? 


During each of four experiments on the use of carbon tetrachloride as a worm 
killer, ten rats were infested with larvae (Armitage 1983). Eight days later, five 
rats were treated with carbon tetrachloride; the other five were kept as controls. 
After two more days, all the rats were killed and the numbers of worms were 
counted. The table below gives the counts of worms for the four control groups. 
Significant differences, although not expected, might be attributable to changes in 


22. 


23. 


24. 


25. 
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experimental conditions. A finding of significant differences could result in more 
carefully controlled experimentation and thus greater precision in later work. 
Use both graphical techniques and the F test to test whether there are significant 
differences among the four groups. Use a nonparametric technique as well. 





GroupI Group II Group III Group IV 
279 378 172 381 
338 275 335 346 
334 412 335 340 
198 265 282 471 
303 286 250 318 





Referring to Section 12.2, the file cablets gives the measurements on chlor- 
pheniramine maleate tablets from another manufacturer. Are there systematic 
differences between the labs? If so, which pairs differ significantly? How do 
these data compare to those given for the other manufacturer in Section 12.2? 


For a study of the release of luteinizing hormone (LH), male and female rats 
kept in constant light were compared to male and female rats in a regime of 14 h 
of light and 10 h of darkness. Various dosages of luteinizing releasing factor 
(LRF) were given: control (saline), 10, 50, 250, and 1250 ng. Levels of LH (in 
nanograms per milliliter of serum) were measured in blood samples at a later time. 
Analyze the data given in file LHfemale, LHmale to determine the effects of 
light regime and LRF on release of LH for both males and females. Use both 
graphical techniques and more formal analyses. 


A collaborative study was conducted to study the precision and homogeneity of 
a method of determining the amount of niacin in cereal products (Campbell and 
Pelletier 1962). Homogenized samples of bread and bran flakes were enriched 
with 0, 2, 4, or 8 mg of niacin per 100 g. Portions of the samples were sent 
to 12 labs, which were asked to carry out the specified procedures on each of 
three separate days. The data (in milligrams per 100 g) are given in the file 
niacin. Conduct two-way analyses of variance for both the bread and bran 
data and discuss the results. (Two data points are missing. Substitute for them 
the corresponding cell means.) 


This problem deals with an example from Youden (1962). An ingot of magnesium 
alloy was drawn into a square rod about 100 m long with a cross section of about 
4.5 cm on a side. The rod was then cut into 100 bars, each a meter long. Five of 
these were selected at random, and a test piece 1.2 cm thick was cut from each. 
From each of these five specimens, 10 test points were selected in a particular 
geometric pattern. Two determinations of the magnesium content were made 
at each test point (the analyst ran all 50 points once and then made a set of 
repeat measurements). The overall purpose of the experiment was to test for 
homogeneity of magnesium content in the different bars and different locations. 
Analyze the data in the file magnesium (giving percentage of magnesium times 
1000) to determine if there is significant variability between bars and between 
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27. 


28. 


locations. There are a couple of unexpected aspects of these data—can you find 
them? 


The concentrations (in nanograms per milliliter) of plasma epinephrine were 
measured for 10 dogs under isofluorane, halothane, and cyclopropane anes- 
thesia; the measurements are given in the following table (Perry et al. 1974). 
Is there a difference in treatment effects? Use a parametric and a nonparametric 
analysis. 





Dog Dog Dog Dog Dog Dog Dog Dog Dog Dog 
1 2 3 4 5 6 7 8 9 10 





Isofluorane 28 51 1.00 .39 .29 .36 .32 .69 JU 433 
Halothane 30 .39 .63 .68 38 21 .88  .39 1 .32 
Cyclopropane 1.07 1.35 .69 .28 1.24 1.53 .49 ..56 1.02 .30 





Three species of mice were tested for “aggressiveness.” The species were A/J, 
C57, and F2 (a cross of the first two species). A mouse was placed in a 1-m? 
box, which was marked off into 49 equal squares. The mouse was let go on the 
center square, and the number of squares traversed in a 5-min period was counted. 
Analyze the file C57, AJ, F2, using the Bonferroni method, to determine if there 
is a significant difference among species. 


Samples of each of three types of stopwatches were tested. The following table 
gives thousands of cycles (on-off-restart) survived until some part of the mecha- 
nism failed (Natrella 1963). Test whether there is a significant difference among 
the types, and if there is, determine which types are significantly different. Use 
both a parametric and a nonparametric technique. 








Type I Type II Type III 
1.7 13.6 13.4 
1.9 19.8 20.9 
6.1 25.2 25.1 

12.5 46.2 29.7 
16.5 46.2 46.9 
25.1 61.1 

30.5 

42.1 

82.5 





29. The performance of a semiconductor depends upon the thickness of a layer of 


silicon dioxide. In an experiment (Czitrom and Reece, 1997), layer thicknesses 
were measured at three furnace locations for three types of wafers (virgin wafers, 
recycled in-house wafers, and recycled wafers from an external source). The data 
are contained in the file wa£erlayers.Conducta two-way analysis of variance 
and test for significance of main effects and interactions. Construct a graph such 
as that shown in Figure 12.3. Does the comparison of layer thicknesses depend 
on furnace location? 


30. 


31. 


32. 


33. 


12.5 Problems 511 


Ten varieties of linseed were grown on six different plots (Adguna and 
Labuschagne, 2002). The file 1inseed contains the yields (kg per hectare). 
Can you conclude that the varieties have different yields? Use Tukey's method 
to compare the varieties. 


Problem 39 of Chapter 10 involved a table of maximum windspeeds for 35 years 
at each of 21 cities. Would you expect an additive model (no interaction) to 
provide a good fit to the numbers in this table? Why or why not? Check it out. 


It is known that increased reproductions leads to reduced longevity for female 
fruitflies. Patridge and Farquhar (1981) studied whether the same phenomenon 
held for male fruitflies. The data are also discussed in Hanley and Shapiro (1994). 
The experiment set up five treatment groups, each consisting of 25 randomly 
assigned male fruitflies. The males in one treatment were housed with eight 
virgin females per day. In another treatment, the males were housed with one 
virgin female day. There were three control groups: males housed with eight 
newly impregnated females, housed with one newly impregnated female, and 
housed alone. (Newly inseminated females will not usually mate within two 
days). 

The data are contained in the file fruitfly, with a row for each male in the 
following format: 


Column 1: the number of females 

Column 2: the type of female—0 denoting newly pregnant, 1 denoting virgin, 
and 9 when there were no females 

Column 3: lifespan in days 

Column 4: length of thorax (mm), which is fixed at birth 

Column 5: percentage of time spent sleeping 


a. Calculate summary statistics for lifespan in each group and compare. Display 
the data in parallel boxplots. Qualitatively, what do you conclude? 

b. Do the same for percentage of time spent sleeping. 

c. Make a scatterplot of lifespan versus thorax length. Is thorax length predic- 
tive of lifespan and did the randomization balance thorax length between the 
groups? 

d. Usethe F testto test for differences in longevity between the groups. Use both 
Tukey's method and the Bonferroni method to compare all pairs of means. 
Summarize your conclusions. 

e. Repeat the analysis using the Kruskal-Wallis test and the Bonferroni method. 

f. How does the availability of virgin females affect the sleep of male fruitflies? 


How does diet affect longevity? Studies on animals have shown that restrict- 
ing caloric intake can increase lifespan. Weindruch et al. did an experiment 
involving six treatment groups of female mice. The data, contained in the file 
diet-and-longevity, are also discussed in Ramsey and Shafer (2002). 
The groups were: 

1. NP: mice ate as much as they wished of a standard diet. 


2. N/N85: mice were fed normally before and after weaning. After weaning, 
their caloric intake was 85 kcal per week, which is the normal average level. 
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3. N/R50: mice were fed normally before weaning; and after weaning, their 
caloric intake was restricted to 50 kcal per week. 

4. R/R50: mice were fed 50 kcal per week before and after weaning. 

5. lopro: mice were fed normally before weaning, a restricted diet of 50 kcal per 
week after weaning, and the dietary protein content decreased as they got older. 

6. N/R40: mice were fed normally before weaning and were given 40 kcal per 
week after weaning. 


As well as making parallel boxplots and conducting an overall test for equal- 
ity of means, scientific questions of interest involve some specific comparisons. 
For example, to determine whether reducing from a normal 85 kcal per week to 
50 kcal per week, groups N/N85 and N/R50 would be compared. Which groups 
would you compare to answer the following questions: 


a. Do preweaning dietary restrictions have an effect? 
b. Does reduction in protein have an effect? 
c. Does reduction to 40 kcal per week have an effect? 


Formulate the comparisons you wish to make and carry them out using an ap- 
propriate Bonferroni correction. What is the purpose of including the group NP? 


The following table gives the survival times (in hours) for animals in an ex- 
periment whose design consisted of three poisons, four treatments, and four 
Observations per cell. 


a. Conduct a two-way analysis of variance to test the effects of the two main 
factors and their interaction. 

b. Box and Cox (1964) analyzed the reciprocals of the data, pointing out that the 
reciprocal of a survival time can be interpreted as the rate of death. Conduct a 
two-way analysis of variance, and compare to the results of part (a). Comment 
on how well the standard two-way analysis of variance model fits and on the 
interaction in both analyses. 























Treatment 
Poison A B c D 
I 3.1 4.5 8.2 11.0 4.3 4.5 4.5 TA 
4.6 4.3 8.8 T2 6.3 7.6 6.6 6:2 
II 3.6 2.9 9.2 6.1 4.4 3:5 5.6 10.0 
4.0 2.3 4.9 12.4 3.1 4.0 7.1 3.8 
m 2:2 21 3.0 3.7 2.3 2.5 3.0 3.6 
1.8 2.3 3.8 2.9 2.4 2.2 3.1 3:3 


The concentration of follicle stimulating hormone (FSH) can be measured through 
a bioassay. The basic idea is that when FSH is added to a certain culture, a pro- 
portional amount of estrogen is produced; hence, after calibration, the amount of 
FSH can be found by measuring estrogen production. However, determining FSH 
levels in serum samples is difficult because some factor(s) in the serum inhibit 
estrogen production and thus screw up the bioassay. An experiment was done to 
see if it would be effective to pretreat the serum with polyethyleneglycol (PEG) 
which, it was hoped, precipitates some of the inhibitory substance(s). 
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Three treatments were applied to prepared cultures: no serum, PEG-treated 
FSH free serum, and untreated FSH free serum. Each culture had one of eight 
doses of FSH: 4, 2, 1, .5, .25, .125, .06, or 0.0 mIU/4. For each serum-dose 
combination, there were three cultures, and after incubation for three days, each 
culture was assayed for estrogen by radioimmunoassay. The table that follows 
gives the results (units are nanograms of estrogen per milliliter). Analyze these 
data with a view to determining to what degree PEG treatment is successful in 
removing the inhibitory substances from the serum. Write a brief report summa- 
rizing and documenting your conclusions. 








Dose No Serum PEG Serum Untreated Serum 
.00 1,814.4 372.7 1,745.3 
.00 3,043.2 350.1 2,470.0 
.00 3,857.1 426.0 1,700.0 
.06 2,447.9 628.3 1,919.2 
.06 3,320.9 655.0 1,605.1 
.06 3,387.6 700.0 2,796.0 
.12 4,887.8 1,701.8 1,929.7 
.12 5,171.2 2,589.4 1,537.3 
12 3,370.7 1,117.1 1,692.7 
:25 10,255.6 4,114.6 1,149.1 
25 9,431.8 2,761.5 743.4 
25 10,961.2 1,975.8 948.5 
50 14,538.8 6,074.3 4,471.9 
.50 14,214.3 12:273,9 2;172,1 
.50 16,934.5 14,240.9 5,782.3 

1.00 19,719.8 17,889.9 11,588.7 
1.00 20,801.4 11,685.7 8,249.5 
1.00 32,740.7 11,342.4 18,481.5 

2.00 16,453.8 11,843.5 10,433.5 

2.00 28,793.8 18,320.7 8,181.0 

2.00 19,148.5 23,580.6 11,104.0 

4.00 17,967.0 12,380.0 10,020.0 

4.00 18,768.6 20,039.0 8,448.5 


4.00 19,946.9 15,135.6 10,482.8 
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CHAPTER 13 


The Analysis of 
Categorical Data 


Introduction 


This chapter introduces the analysis of data that are in the form of counts in various 
categories. We will deal primarily with two-way tables, the rows and columns of 
which represent categories. Suppose that the rows of such a table represent various 
hair colors and its columns various eye colors and that each cell contains a count 
of the number of people who fall in that particular cross-classification. We might be 
interested in dependencies between the row and column classifications—that is, is 
hair color related to eye color? 

We emphasize that the data considered in this chapter are counts, rather than 
continuous measurements as they were in Chapter 12. Thus, in this chapter, we will 
make heavy use of the multinomial and chi-square distributions. 


Fisher's Exact Test 


We will develop Fisher's exact test in the context of the following example. Rosen and 
Jerdee (1974) conducted several experiments, using as subjects male bank supervisors 
attending a management institute. As part of their training, the supervisors had to make 
decisions on items in an in-basket. The investigators embedded their experimental 
materials in the contents of the in-baskets. In one experiment, the supervisors were 
given a personnel file and had to decide whether to promote the employee or to hold the 
file and interview additional candidates. By random selection, 24 of the supervisors 
examined a file labeled as being that of a male employee and 24 examined a file 
labeled as being that of a female employee; the files were otherwise identical. The 
results are summarized in the following table: 
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Male Female 





Promote 21 14 











Hold File 3 10 


From the results, it appears that there is a sex bias—21 of 24 males were promoted, 
but only 14 of 24 females were. But someone who was arguing against the presence of 
sex bias could claim that the results occurred by chance; that is, even if there were no 
bias and the supervisors were completely indifferent to sex, discrepancies like those 
observed could occur with fairly large probability by chance alone. To rephrase this 
argument, the claim is that 35 of the 48 supervisors chose to promote the employee 
and 13 chose not to, and that 21 of the 35 promotions were of a male employee merely 
because of the random assignment of supervisors to male and female files. 

The strength of the argument against sex bias must be assessed by a calculation 
of probability. If it is likely that the randomization could result in such an imbalance, 
the argument is difficult to refute; however, if only a small proportion of all possible 
randomizations would give such an imbalance, the argument has less force. We take 
as the null hypothesis that there is no sex bias and that any differences observed are 
due to the randomization. We denote the counts in the table and on the margins as 
follows: 


Nu Ni, n, 














According to the null hypothesis, the margins of the table are fixed: There are 24 
females, 24 males, 35 supervisors who choose to promote, and 13 who choose not 
to. Also, the process of randomization determines the counts in the interior of the 
table (denoted by capital letters since they are random) subject to the constraints of 
the margins. With these constraints, there is only 1 degree of freedom in the interior 
of the table; if any interior count is fixed, the others may be determined. 

Consider the count N;;, the number of males who are promoted. Under the null 
hypothesis, the distribution of Nj, is that of the number of successes in 24 draws 
without replacement from a population of 35 successes and 13 failures; that is, the 
distribution of N;,; induced by the randomization is hypergeometric. The probability 
that Ni = Nji is 


Se 
a 
We will use Nj; as the test statistic for testing the null hypothesis. The preceding 


hypergeometric probability distribution is the null distribution of N;, and is tabled 
here. A two-sided test rejects for extreme values of N41. 


p(n) = 
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ny pi) 
11 .000 
12 .000 
13 .004 
14 .021 
15 .072 
16 .162 
17 241 
18 241 
19 .162 
20 .072 
21 .021 
22 .004 
23 .000 
24 .000 


From this table, a rejection region for a two-sided test with a = .05 consists of the 
following values for N;;: 11, 12, 13, 14, 21, 22, 23, and 24. The observed value of Nj, 
falls in this region, so the test would reject at level .05. An imbalance in promotions 
as or more extreme than that observed would occur only by chance with probability 
.05, so there is fairly strong evidence of gender bias. 


The Chi-Square Test of Homogeneity 


Suppose that we have independent observations from J multinomial distributions, 
each of which has J cells, and that we want to test whether the cell probabilities 
of the multinomials are equal—that is, to test the homogeneity of the multinomial 
distributions. 

As an example, we will consider a quantitative study of an aspect of literary 
style. Several investigators have used probabilistic models of word counts as indices 
of literary style, and statistical techniques applied to such counts have been used in 
controversies about disputed authorship. An interesting account is given by Morton 
(1978), from whom we take the following example. 

When Jane Austen died, she left the novel Sanditon only partially completed, 
but she left a summary of the remainder. A highly literate admirer finished the novel, 
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attempting to emulate Austen's style, and the hybrid was published. Morton counted 
the occurrences of various words in several works: Chapters 1 and 3 of Sense and 
Sensibility, Chapters 1, 2, and 3 of Emma, Chapters 1 and 6 of Sanditon (written by 
Austen); and Chapters 12 and 24 of Sanditon (written by her admirer). The counts 
Morton obtained for six words are given in the following table: 


























Word Sense and Sensibility Emma Sanditon I Sanditon II 
a 147 186 101 83 
an 25 26 11 29 
this 32 39 15 15 
that 94 105 37 22 
with 59 74 28 43 
without 18 10 10 4 
Total 375 440 202 196 














We will compare the relative frequencies with which these words appear and will 
examine the consistency of Austen’s usage of them from book to book and the degree 
to which her admirer was successful in imitating this aspect of her style. A stochastic 
model will be used for this purpose: The six counts for Sense and Sensibility will 
be modeled as a realization of a multinomial random variable with unknown cell 
probabilities and total count 375; the counts for the other works will be similarly 
modeled as independent multinomial random variables. 

Thus, we must consider comparing J multinomial distributions each having [ 
categories. If the probability of the ith category of the jth multinomial is denoted zr;;, 
the null hypothesis to be tested is 


Ao: Ti = Ti = +++ = Ty, A eee 


We may view this as a goodness-of-fit test: Does the model prescribed by the null 
hypothesis fit the data? To test goodness of fit, we will compare observed values with 
expected values as in Chapter 9, using likelihood ratio statistics or Pearson’s chi- 
square statistic. We will assume that the data consist of independent samples from 
each multinomial distribution, and we will denote the count in the ith category of the 
jth multinomial as n;;. 

Under Ho, each of the J multinomials has the same probability for the ith cate- 
gory, say z;. The following theorem shows that the mle of zr; is simply n;./n.., which 
is an obvious estimate. Here, n;, is the total count in the ith category, n., is the grand 
total count, n; is the total count for the jth multinomial. 
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THEOREM A 
Under Ho, the mle's of the parameters 71, 72, ... , 7; are 
ni. 
p Em "esed edi 
n. 


where n; is the total number of responses in the ith category and n. is the grand 
total number of responses. 


Proof 


Since the multinomial distributions are independent, 


di 
: nj hij Har nij 
WGA, Ta ones T) = | | ( / ) Thy m ote, 
nj 


po Vogi 


d 
n nj 
"ll 
cp wage oce 


J 


= qa... 


Let us consider maximizing the log likelihood subject to the constraint 5 m am 
1. Introducing a Lagrange multiplier, we have to maximize 


Uu T I 
Ga i) => toe ( D ) xen (Xr) 
j=l Tj i=l i=l 


nN} jN2; 
Now, 
al nj. 
u— 5 SS Ey [e jl. 5 Jl 
0 i Ti 
or 
^ hi, 
C= >= 
Xr 
Summing over both sides and applying the constraint, we find A = —7n, and 
fi; = nj /n,, as was to be proved. a 


For the jth multinomial, the expected count in the ith category is the estimated 


probability of that cell times the total number of observations for the jth multinomial, 


or 


n jhi, 


ij — 
d n. 


E 


Pearson's chi-square statistic is therefore 


nin.;/n., 


13.3 The Chi-Square Test of Homogeneity 519 


For large sample sizes, the approximate null distribution of this statistic is chi-square. 
(The usual recommendation concerning the sample size necessary for this approx- 
imation to be reasonable is that the expected counts should all be greater than 5.) 
The degrees of freedom are the number of independent counts minus the number of 
independent parameters estimated from the data. Each multinomial has 7 — 1 inde- 
pendent counts, since the totals are fixed, and 7 — 1 independent parameters have 
been estimated. The degrees of freedom are therefore 


dí2JI-D-G-D-20-DGU-1 


We now apply this method to the word counts from Austen's works. First, we 
consider Austen's consistency from one work to another. The following table gives 
the observed count and, below it, the expected count in each cell of the table. 























Word Sense and Sensibility Emma Sanditon I 
a 147 186 101 
160.0 187.8 86.2 
an 25 26 11 
22.9 26.8 12:3 
this 32 39 15 
317 312 17.1 
that 94 105 37 
87.0 102.1 46.9 
with 59 74 28 
59.4 69.7 32.0 
without 18 10 10 
14.0 16.4 7.5 











The observed counts look fairly close to the expected counts, and the chi-square 
statistic is 12.27. The 1046 point of the chi-square distribution with 10 degrees of 
freedom is 15.99, and the 25% point is 12.54. The data are thus consistent with the 
model that the word counts in the three works are realizations of multinomial random 
variables with the same underlying probabilities. The relative frequencies with which 
Austen used these words did not change from work to work. 

To compare Austen and her imitator, we can pool all Austen's work together 
in light of the above findings. The following table shows the observed and expected 
frequencies for the imitator and Austen: 
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Word Imitator Austen 
a 83 434 
83.5 433.5 
an 29 62 
14.7 76.3 
this 15 86 
16.3 84.7 
that 22 236 
41.7 216.3 
with 43 161 
33.0 171.0 
without 4 38 
6.8 35.2 


The chi-square statistic is 32.81 with 5 degrees of freedom, giving a p-value of 
less than .001. The imitator was not successful in imitating this aspect of Austen's 
style. To see which discrepancies are large, it is helpful to examine the contributions 
to the chi-square statistic cell by cell, as tabulated here: 


























Word Imitator Austen 
a 0.00 0.00 
an 13.90 2.68 
this 0.11 0.02 
that 9.30 1.79 
with 3.06 0.59 
without 1.14 0.22 


Inspecting this and the preceding table, we see that the relative frequency with which 
Austen used the word an was much smaller than that with which her imitator used it, 
and that the relative frequency with which she used that was much larger. 


The Chi-Square Test of Independence 


This section develops a chi-square test that is very similar to the one of the preceding 
section but is aimed at answering a slightly different question. We will again use an 


example. 
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In a demographic study of women who were listed in Who's Who, Kiser and 
Schaefer (1949) compiled the following table for 1436 women who were married at 
least once: 











Education Married Once Married More Than Once Total 
College 550 61 611 
No College 681 144 825 
Total 1231 205 1436 











Is there a relationship between marital status and educational level? Of the women 
who had a college degree, $'. = 10% were married more than once; of those who 


* 611 
had no college degree, 133 = 17% were married more than once. Alternatively, the 


question might be addressed by noting that of those women who were married more 
than once, ít — 3096 had a college degree, whereas of those married only once, 
— = 45% had a college degree. For this sample of 1436 women, having a college 
degree is positively associated with being married only once, but it is impossible 
to make causal inferences from the data. Marital stability could be influenced by 
educational level, or both characteristics could be influenced by other factors, such 
as social class. 

A critic of the study could in any case claim that the relationship between marital 
status and educational level is "statistically insignificant." Since the data are not a 
sample from any population, and since no randomization has been performed, the 
role of probability and statistics is not clear. One might respond to such a criticism 
by saying that the data speak for themselves and that there is no chance mechanism 
at work. The critic might then rephrase his argument: “If I were to take a sample of 
1436 people cross-classified into two categories which were in fact unrelated in the 
population from which the sample was drawn, I might find associations as strong or 
stronger than those observed in this table. Why should I believe that there is any real 
association in your table?" Even though this argument may not seem compelling, 
statistical tests are often carried out in situations in which stochastic mechanisms are 
at best hypothetical. 

We will discuss statistical analysis of a sample of size n cross-classified in a 
table with / rows and J columns. Such a configuration is called a contingency table. 
The joint distribution of the counts n;;, where i = 1,..., J and j = 1,..., J, is 
multinomial with cell probabilities denoted as 7r;;. Let 


J 
TT = X Tij 
j=l 
I 
my c ? ) Tij 
i=l 


denote the marginal probabilities that an observation will fall in the ith row and 
jth column, respectively. If the row and column classifications are independent of 
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each other, 
Tjj — Tj m. 
We thus consider testing the following null hypothesis: 
Ho: Wij = iT j, Hla, de J= lenn 


versus the alternative that the z;; are free. Under Ho, the mle of 7r;; is 


Rij = AR; 
ni nj 
og ie 
n n 
(see Problem 10 at the end of this chapter). Under the alternative, the mle of 7r;; is 
simply 
23 nij 
C= 
2 n 
These estimates can be used to form a likelihood ratio test or an asymptotically 
equivalent Pearson's chi-square test, 


rod 
xi- x > (Oi; pet 
i=l j=l 


Here the Oj; are the observed counts (n;;). The expected counts, the £;;, are the fitted 
counts: 





Eij = nfi = 


Pearson’s chi-square statistic is, therefore, 


I J 
x -y (nij — nin. j/n) 
i=l j=l nin j/n 


The degrees of freedom for the chi-square statistic are calculated as in Sec- 
tion 9.5. Under Q., the cell probabilities sum to 1 but are otherwise free and there 
are thus JJ — 1 independent parameters. Under the null hypothesis, the marginal 
probabilities, are estimated from the data and are specified by (7 — 1) + (J — 1) 
independent parameters. Thus, 


diey-I-(Qqei-eu-1 0-434 -1) 





Returning to the data on 1436 women from the demographic study, we calculate 
expected values and construct the following table: 
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Education Married Once More Than Once 
College 550 61 

523.8 87.2 
No College 681 144 

707.2. 117.8 








The chi-square statistic is 16.01 with 1 degree of freedom, giving a p-value less than 
.001. We would reject the hypothesis of independence and conclude that there is a 
relationship between marital status and educational level. 

The chi-square statistic used here to test independence is identical in form and 
degrees of freedom to that used in the preceding section to test homogeneity; how- 
ever, the hypotheses are different and the sampling schemes are different. The test 
of homogeneity was derived under the assumption that the column (or row) margins 
were fixed, and the test of independence was derived under the assumption that only 
the grand total was fixed. Because the test statistics are computed in an identical 
fashion and have the same number of degrees of freedom, the distinction between 
them is often slurred over. Furthermore, the notions of homogeneity and indepen- 
dence are closely related and easily confused. Independence can be thought of as 
homogeneity of conditional distributions; for example, if education level and marital 
status are independent, then the conditional probabilities of marital status given edu- 
cational level are homogeneous— P (Married Once | College) — P(Married Once | 
No College). 


Matched-Pairs Designs 


Matched-pairs designs can be effective for experiments involving categorical data; 
as with experiments involving continuous data, pairing can control for extraneous 
sources of variability and can increase the power of a statistical test. Appropriate 
techniques, however, must be used in the analysis of the data. This section begins 
with an extended example illustrating these concepts. 


Vianna, Greenwald, and Davies (1971) collected data comparing the percentages 
of tonsillectomies for a group of patients suffering from Hodgkin's disease and a 
comparable control group: 





Tonsillectomy No Tonsillectomy 





Hodgkin's 67 34 











Control 43 64 


The table shows that 6696 of the Hodgkin's sufferers had had a tonsillectomy, com- 
pared to 4096 of the control group. The chi-square test for homogeneity gives a 
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chi-square statistic of 14.26 with 1 degree of freedom, which is highly significant. 
The investigators conjectured that the tonsils act as a protective barrier in some fashion 
against Hodgkin's disease. 

Johnson and Johnson (1972) selected 85 Hodgkin's patients who had a sibling 
of the same sex who was free of the disease and whose age was within 5 years of the 
patient's. These investigators presented the following table: 





Tonsillectomy No Tonsillectomy 


Hodgkin's 41 44 











Control 33 52 


They calculated a chi-square statistic of 1.53, which is not significant. Their findings 
thus appeared to be at odds with those of Vianna, Greenwald, and Davies. 

Several letters to the editor of the journal that published Johnson and Johnson’s 
results pointed out that those investigators had made an error in their analysis by 
ignoring the pairings. The assumption behind the chi-square test of homogeneity 
is that independent multinomial samples are compared, and Johnson and Johnson’s 
samples were not independent, because siblings were paired. An appropriate analysis 
of Johnson and Johnson’s data is suggested once we set up a table that exhibits the 
pairings: 

















Sibling 
No Tonsillectomy Tonsillectomy 
Patient No Tonsillectomy 37 7 
Tonsillectomy 15 26 





Viewed in this way, the data are a sample of size 85 from a multinomial distribution 
with four cells. We can represent the probabilities in the table as follows: 








Tu T12 TI 
7| T22 T2 
71 JU» 1 


The appropriate null hypothesis states that the probabilities of tonsillectomy 
and no tonsillectomy are the same for patients and siblings—that is, 7}, = a and 
T2, = 705, OF 


My + 702 = 704 + M21 
Tiz + 7022 = T + T2 


These equations simplify to m12 = 71. 
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The relevant null hypothesis is thus 
Ho: 12 = 703 


Under the null hypothesis, the off-diagonal probabilities are equal, and under the 
alternative they are not. The diagonal probabilities do not distinguish the null and 
alternative hypotheses. We will derive a test, called McNemar's test, of this hypoth- 
esis. Under Hp, the mle's of the cell probabilities are (see Problem 10 at the end of 
this chapter) 


ni 


T = 
n 
x n2 
T22 = — 
n 
R A ni» + nj 
Miz = Ha — — —— —— 
2n 


The contributions to the chi-square statistic from the nı; and nz cells are equal to 
zero; the remainder of the statistic is 


= [nio — (ni2 d n2)/2P. [nza — Q2 +1021) /2P 
(ni2 + n31)/2 (ni2 + na )/2 


2 Q2 — n3)? 


nj» + n» 


x? 





Let us count degrees of freedom: Under Q there are three free parameters since there 
are four cell probabilities which are constrained to sum to 1. Under the null hypothesis, 
there is the additional constraint, m12 = 721, and there are two free parameters. 
The chi-square statistic thus has 1 degree of freedom. For the data table exhibiting 
the pairings, X? — 2.91, with a corresponding p-value of .09. This casts doubt on the 
null hypothesis, contrary to Johnson and Johnson's original analysis. a 


Cell Phones and Driving 

Does the use of cell phones while driving cause accidents? This is a difficult question 
to study empirically. An observational study comparing accident rates of users and 
nonusers would be subject to numerous sources of confounding, such as age, gender, 
and time and place of driving. A randomized, controlled experiment in which drivers 
were randomly assigned to use or not use cell phones is infeasible, partly because it 
would be unethical to deliberately expose people to a potentially hazardous condition. 
Double blinding would clearly be impossible. Redelmeier and Tibshirani (1997) con- 
ducted a clever study, designed in the following way. They identified 699 drivers who 
owned cell phones and who had been involved in motor vehicle collisions. They then 
used billing records to determine whether each individual used a cell phone during 
the 10 minutes preceding the collision and also at the same time during the previous 
week. (For more details, see the cited paper.) Each person thus served as his own 
control, eliminating various sources of confounding. The results are laid out in the 
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13.6 


following table: 





Collision Before Collision 





On Phone Not on Phone Total 
On Phone 13 157 170 
Not on Phone 24 505 529 


Total 37 662 699 




















From the table, on the day of the collision, 2496 of the drivers had been on the 
phone as compared to 5% the day before the collision. McNemar's test can be applied 
to test the null hypothesis of no association: 


, (157 — 24}? 
1574+24 
= 97.7 


There is thus no doubt that the association is statistically significant. However, the 
authors pointed out that this result does not necessarily imply that cell phone use 
while driving causes more accidents—for example, it is possible that during times 
of emotional stress, drivers are more likely to use cell phones, and because of the 
emotional stress are also less attentive to their driving. E 


Odds Ratios 


If an event A has probability P(A) of occurring, the odds of A occurring are defined 
to be 


P(A) 
odds(A) = ———— 
1— P(A) 
Since this implies that 
odds(A) 
P(A) = ————— 
1 + odds(A) 


odds of 2 (or 2 to 1), for example, correspond to P(A) = 2/3. 

Now suppose that X denotes the event that an individual is exposed to a potentially 
harmful agent and that D denotes the event that the individual becomes diseased. We 
denote the complementary events as X and D. The odds of an individual contracting 
the disease given that he is exposed are 


P(D|X) 
odds(D|X) = ———_— 
1 — P(D|X) 
and the odds of contracting the disease given that he is not exposed are 
— P(D|X) 
odds(D|X) — 


1— P(D|X) 
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The odds ratio 
odds(D|X) 
— odds(D|X) 
is a measure of the influence of exposure on subsequent disease. 
We will consider how the odds and odds ratio could be estimated by sampling 
from a population with joint and marginal probabilities defined as in the following 
table: 














D D 
X 7t00 To TO 
X T10 Tit T1, 
To 71 1 
With this notation, 
T 
P(D|X) = ———— 
Tio + T11 
= T 
PD = — > 
Too + Tor 
so that 
T 
odds(D|X) = = 
T10 
= TT 
odds(D|X) = = 
Too 
and the odds ratio is 
A= 70317000 
791710 


the product of the diagonal probabilities in the preceding table divided by the product 
of the off-diagonal probabilities. 

Now we will consider three possible ways to sample this population to study 
the relationship of disease and exposure. First, we might consider drawing a random 
sample from the entire population; from such a sample we could estimate all the 
probabilities directly. However, if the disease is rare, the total sample size would have 
to be quite large to guarantee that a substantial number of diseased individuals was 
included. 

A second method of sampling is called a prospective study—a fixed number of 
exposed and nonexposed individuals are sampled, and the incidences of disease in 
those two groups are compared. In this case the data allow us to estimate and compare 
P(D|X) and P(D|X) and, hence, the odds ratio. For example, P(D|X) would be 
estimated by the proportion of exposed individuals who had the disease. However, 
note that the individual probabilities ;;; cannot be estimated from the data, because 
the marginal counts of exposed and unexposed individuals have been fixed arbitrarily 
by the sampling design. 

A third method of sampling—a retrospective study—is one in which a fixed 
number of diseased and undiseased individuals are sampled and the incidences of 
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exposure in the two groups are compared. The study of Vianna, Greenwald, and 
Davies (1971) discussed in the previous section was of this type. From such data, 
we can directly estimate P(X|D) and P(X|D) by the proportions of diseased and 
nondiseased individuals who were exposed. Because the marginal counts of diseased 
and nondiseased are fixed, we cannot estimate the joint probabilities or the important 
conditional probabilities P (D|X) and P(D|X). However, as will be shown, we can 
estimate the odds ratio A. Observe that 


T 
P(X|D) 2 ——— 
To, + 1 

T 
Pep). e 
791 + T11 


T 
odds(X|D) = — 
791 


Similarly, 
AX T0 
odds(X|D) = — 
Too 
We thus see that the odds ratio, A, defined previously, can also be expressed as 
E odds(X | D) 
~ odds(X|D) 
Specifically, suppose that the counts in such a study are denoted as in the following 
table: 








Then the conditional probabilities and the odds ratios are estimated as 


^ nii 
P(X|D) = — 


^ n 
]- Ê(XID) = 
n 


— n 
odds(X|D) = — 





noi 
Similarly, 
ae m nio 
odds(X|D) = — 
Noo 
so that the estimate of the odds ratio is 
A _ Noon 11 
1917110 


the product of the diagonal counts divided by the product of the off-diagonal counts. 
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As an example, consider the table in the previous section that displays the data 
of Vianna, Greenwald, and Davies. The odds ratio is estimated to be 


a 67x64 
A= =2. 
43 x 34 di 





According to this study, the odds of contracting Hodgkin's disease are increased by 
about a factor of three by undergoing a tonsillectomy. 

As well as having a point estimate A = 2.93, it would be useful to attach 
an approximate standard error to the estimate to indicate its uncertainty. Since À 
is a nonlinear function of the counts, it appears that an analytical derivation of its 
standard error would be difficult. Once again, however, the convenience of simulation 
(the bootstrap) comes to our aid. In order to approximate the distribution of À by 
simulation, we need to generate random numbers according to a statistical model 
for the counts in the table of Vianna, Greenwald, and Davies. The model is that the 
count in the first row and first column, N;;, is binomially distributed with n = 101 and 
probability 711. The count in the second row and second column, N22, is independently 
binomially distributed with n = 107 and probability 7». The distribution of the 
random variable 


Nu N» 
(101 — N1,)(107 — Noy) 


A= 





is thus determined by the two binomial distributions, and we could approximate it 
arbitrarily well by drawing a large number of samples from them. 

Since the probabilities xı and zz. are unknown, they are estimated from the 
observed counts by 7%; = 67/101 = .663 and n = 64/107 = .598. One thousand 
realizations of binomial random variables Nj; and N22 were generated on a computer 
and Figure 13.1 shows a histogram of the resulting 1000 values of A. The standard 
deviation of these 1000 values was .89, which can be used as an estimated standard 
error for our observed estimate À — 2.93. 











2 4 6 8 


FIGURE 13.1 Histogram of 1000 bootstrapped estimates of the odds ratio, A. 
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13.7 


Concluding Remarks 


This chapter has introduced two-way classifications, which are the simplest form 
of contingency tables. For higher-order classifications, which frequently occur in 
practice, a greater variety of forms of dependence arise. For example, for a three-way 
table, the factors of which are denoted A, B, and C, we might consider testing whether, 
conditionally on C, A and B are independent. 

Dependencies can be specified by means of log linear models. If the row and 
column classifications in a two-way table are independent, then 


Tjj = Tj 
or 
log zi; = log m; + log x; 
We can denote log x; by a; and log; by f;. Then, if there is dependence, log 7;; 
may be written as 
log mij; = oi + B; + yi 


This form mimics the additive analysis of variance models introduced in Chap- 
ter 12. The idea can readily be extended to higher-order tables. For example, a possible 
model for a three-way table is 


log Tijk = ot; + Bj + ôi + Eik + Vix 


which allows second-order dependencies, but no third-order dependencies. The pa- 
rameters of log-linear models may be estimated by mle’s and likelihood ratio tests 
may be employed. Agresti (1996) treats these and other topics in the analysis of 
categorical data. 


13.8 Problems 


1. Adult-onset diabetes is known to be highly genetically determined. A study 
was done comparing frequencies of a particular allele in a sample of such 
diabetics and a sample of nondiabetics. The data are shown in the following 





table: 
| Diabetic Normal 
Bb or bb | 12 4 
BB | 39 49 


Are the relative frequencies of the alleles significantly different in the two groups? 


2. Phillips and Smith (1990) conducted a study to investigate whether people could 
briefly postpone their deaths until after the occurrence of a significant occasion. 
The senior woman of the household plays a central ceremonial role in the Chinese 
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Harvest Moon Festival. Phillips and Smith compared the mortality patterns of old 
Jewish women and old Chinese women who died of natural causes for the weeks 
immediately preceding and following the festival, using records from California 
for the years 1960-1984. Compare the mortality patterns shown in the table. 
(Week —1 is the week preceding the festival, week 1 is the week following, 
etc.) 








Week Chinese Jewish 
—2 55 141 
—1 33 145 
1 70 139 
2 49 161 





3. Overfield and Klauber (1980) published the following data on the incidence of 
tuberculosis in relation to blood groups in a sample of Eskimos. Is there any 
association of the disease and blood group within the ABO system or within the 
MN system? 


ABO system 





Severity O A AB B 





Moderate-Advanced 7 5 3 13 
































Minimal 27 32 8 18 
Not Present 55 50 7 24 
MN system 
Severity MM MN NN 
Moderate-Advanced 21 6 1 
Minimal 54 27 5 
Not Present 74 51 11 











4. In a famous sociological study called Middletown, Lynd and Lynd (1956) ad- 
ministered questionnaires to 784 white high school students. The students were 
asked which two of ten given attributes were most desirable in their fathers. The 
following table shows how the desirability of the attribute “being a college grad- 
uate" was rated by male and female students. Did the males and females value 
this attribute differently? 
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Male Female 





Mentioned 86 55 











Not Mentioned 283 360 


5. Dowdall (1974) [also discussed in Haberman (1978)] studied the effect of ethnic 
background on role attitude of women of ages 15—64 in Rhode Island. Respon- 
dents were asked whether they thought it was all right for a woman to have a job 
instead of taking care of the house and children while her husband worked. The 
following table breaks down the responses by ethnic origin of the respondent. Is 
there a relationship between response and ethnic group? If so, describe it. 


























Ethnic Origin Yes No 
Italian 78 47 
Northern European 56 29 
Other European 43 29 
English 53 32 
Irish 43 30 
French Canadian 36 22 
French 42 23 
Portuguese 29 7 








6. It is conventional wisdom in military squadrons that pilots tend to father more 
girls than boys. Snyder (1961) gathered data for military fighter pilots. The 
sex of the pilots’ offspring were tabulated for three kinds of flight duty during 
the month of conception, as shown in the following table. Is there any signifi- 
cant difference between the three groups? In the United States in 1950, 105.37 
males were born for every 100 females. Are the data consistent with this sex 

















ratio? 
Father's Activity Female Offspring Male Offspring 
Flying Fighters 51 38 
Flying Transports 14 16 
Not Flying 38 46 





7. Grades in an elementary statistics class were classified by the students' majors. 
Is there any relationship between grade and major? 
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Major 
Grade Psychology Biology Other 
A 8 15 13 
B 14 19 15 
C 15 4 7 
D-F 3 1 4 








8. A randomized double-blind experiment compared the effectiveness of several 
drugs in ameliorating postoperative nausea. All patients were anesthetized with 
nitrous oxide and ether. The following table shows the incidence of nausea during 
the first four postoperative hours for each of several drugs and a placebo (Beecher 
1959). Compare the drugs to each other and to the placebo. 


























Number of Patients Incidence of Nausea 
Placebo 165 95 
Chlorpromazine 152 52 
Dimenhydrinate 85 52 
Pentobarbital (100 mg) 67 35 
Pentobarbital (150 mg) 85 37 


9. This problem considers some more data on Jane Austen and her imitator (Morton 
1978). The following table gives the relative frequency of the word a preceded by 
(PB) and not preceded by (NPB) the word such, the word and followed by (FB) 
or not followed by (NFB) /, and the word the preceded by and not preceded by on. 























Words Sense and Sensibility Emma Sanditon I Sanditon I 
a PB such 14 16 8 2 
a NPB such 133 180 93 81 
and FBI 12 14 12 1 
and NFB I 241 285 139 153 
the PB on 11 6 8 17 
the NPB on 259 265 221 204 














Was Austen consistent in these habits of style from one work to another? Did her 


imitator successfully copy this aspect of her style? 
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10. 


11. 


12. 


13. 


14. 


15. 


Verify that the mle's of the cell probabilities, 77;;, are as given in Section 13.4 for 
the test of independence and in Section 13.5 for McNemar's test. 


(a) Derive the likelihood ratio test of homogeneity. (b) Calculate the likelihood 
ratio test statistic for the example of Section 13.3, and compare it to Pearson's chi- 
square statistic. (c) Derive the likelihood ratio test of independence. (d) Calculate 
the likelihood ratio test statistic for the example of Section 13.4, and compare it 
to Pearson's chi-square statistic. 


Show that McNemar's test is nearly equivalent to scoring each response as a 0 
or a | and calculating a paired-sample ¢ test on the resulting data. 


A sociologist is studying influences on family size. He finds pairs of sisters, both 
of whom are married, and determines for each sister whether she has 0, 1, or 2 
or more children. He wants to compare older and younger sisters. Explain what 
the following hypotheses mean and how to test them. 


a. The number of children the younger sister has is independent of the number 
of children the older sister has. 

b. The distribution of family sizes is the same for older and younger sisters. 
Could one hypothesis be true and the other false? Explain. 


Lazarsfeld, Berelson, and Gaudet (1948) present the following tables relating 
degree of interest in political elections to education and age: 


No high school education 











Degree of Interest Under 45 Over 45 
Great 71 217 
Little 305 652 


Some high school or more 














Degree of Interest Under 45 Over 45 
Great 305 180 
Little 869 259 


Since there are three factors—education, age, and interest—these tables consid- 
ered jointly are more complicated than the tables considered in this chapter. 


a. Examine the tables informally and analyze the dependence of interest in po- 
litical elections on age and education. What do the numbers suggest? 

b. Extend the ideas of this chapter to test two hypotheses, Hj: given educational 
level, age and degree of interest are unrelated, and H»: given age, educational 
level and degree of interest are unrelated. 


Reread Section 11.4.5, which contains a discussion of methodological problems 
in a study of the effects of FD&C Red No. 40. The following tables give the 
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numbers of mice that developed RE tumors in each of several groups: 
Incidence among males 
Control I | Control 2 | Low Dose | Med. Dose | High Dose 
Number with tumor 25 10 20 9 17 
Total number 100 100 99 100 99 
Incidence among females 
Control I | Control 2 | Low Dose | Med. Dose | High Dose 
Number with tumor 33 25 32 26 22 
Total number 100 99 99 99 100 

















16. 


17. 


Use chi-square tests to compare the incidences in the different groups for males 
and for females. Which differences are significant? What would you conclude 
from this analysis had you not known about the possibility of cage position 
effects? 


A market research team conducted a survey to investigate the relationship of per- 
sonality to attitude toward small cars. A sample of 250 adults in a metropolitan 
area were asked to fill out a 16-item self-perception questionnaire, on the basis 
of which they were classified into three types: cautious conservative, middle-of- 
the-roader, and confident explorer. They were then asked to give their overall 
opinion of small cars: favorable, neutral, or unfavorable. Is there a relationship 
between personality type and attitude toward small cars? If so, what is the nature 
of the relationship? 

















Personality Type 
Attitude Cautious Midroad Explorer 
Favorable 79 58 49 
Neutral 10 8 9 
Unfavorable 10 34 42 








In a study of the relation of blood type to various diseases, the following data 
were gathered in London and Manchester (Woolf 1955): 

















London 
Control Peptic Ulcer 
Group A 4219 579 
Group O 4578 911 
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18. 


19. 


Manchester 





Control Peptic Ulcer 





Group A 3775 246 











Group O 4532 361 


First, consider the two tables separately. Is there a relationship between blood 
type and propensity to peptic ulcer? If so, evaluate the strength of the relationship. 
Are the data from London and Manchester comparable? 


Records of 317 patients at least 48 years old who were diagnosed as having 
endometrial carcinoma were obtained from two hospitals (Smith et al. 1975). 
Matched controls for each case were obtained from the two institutions; the con- 
trols had cervical cancer, ovarian cancer, or carcinoma of the vulva. Each control 
was matched by age at diagnosis (within four years) and year of diagnosis (within 
two years) to a corresponding case of endometrial carcinoma. This sort of design, 
called a retrospective case-control study, is frequently used in medical inves- 
tigations where a randomized experiment is not possible. The following table 
gives the numbers of cases and controls who had taken estrogen for at least 6 
mo prior to the diagnosis of cancer. Is there a significant relationship between 
estrogen use and endometrial cancer? Do you see any possible weak points in a 
retrospective case-control design? 





Controls 





Estrogen Used Not Used 





Cases Estrogen Used 39 113 











Not Used 15 150 


A psychological experiment was done to investigate the effect of anxiety on a per- 
son’s desire to be alone or in company (Schacter 1959; Lehmann 1975). A group 
of 30 subjects was randomly divided into two groups of sizes 13 and 17. The sub- 
jects were told that they would be subjected to some electric shocks, but one group 
was told that the shocks would be quite painful and the other group was told that 
they would be mild and painless. The former group was the “high-anxiety” group, 
and the latter was the “low-anxiety” group. Both groups were told that there would 
be a 10-min wait before the experiment began, and each subject was given the 
choice of waiting alone or with the other subjects. The following are the results: 





Wait Together Wait Alone 





High-Anxiety 12 5 











Low-Anxiety 4 9 


20. 


21. 
22. 
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Use Fisher's exact test to test whether there is a significant difference between 
the high- and low-anxiety groups. 


Define appropriate notation for the three sample designs considered in Sec- 
tion 13.6 (simple random sample, prospective study, and retrospective study). 


a. Show how to estimate the odds ratio A. 
b. Use the method of propagation of error to find approximately Var(log(A)) 
(log(A) is sometimes used in place of A). 


For Problem 1, what is the relevant odds ratio and what is its estimate? 


A study was done to identify factors affecting physicians’ decisions to advise or 
not to advise patients to stop smoking (Cummings et al. 1987). The study was 
related to a training program to teach physicians ways to counsel patients to stop 
smoking and was carried out in a family practice outpatient center in Buffalo, 
New York. The study population consisted of the cigarette-smoking patients of 
residents in family medicine seen in the center between February and May 1984. 


a. We first consider whether certain patient characteristics are related to being 
advised or not being advised. The following table shows a breakdown by sex: 





| Advised Not Advised 
Male 48 47 
Female 80 136 


What proportion of the males were advised to quit and what proportion of 
the females were advised? What are the standard errors of these proportions? 
What is the standard error of their difference? Test whether the difference in 
the proportions is statistically significant. 

Next consider a breakdown by race: White and Other versus African- 
American: 





| Advised Not Advised 
White 26 34 
African-American 102 149 


What proportions of the African-Americans and Whites were asked to quit and 
what are the standard errors of these proportions? What is the standard error 
of the difference of the proportions? Is the difference statistically significant? 

Finally consider the relation of the number of cigarettes smoked daily to 
being advised or not: 





Advised Not Advised 
« 15 64 112 
15-25 39 54 


225 25 16 
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For each of the three groups, what proportion was advised to quit smoking and 
what are the standard errors of the proportions? Is the difference in proportions 
statistically significant? 

b. Next consider the relationship of certain physician characteristics to the deci- 
sion whether to advise. First, the physician's sex: 


| Advised Not Advised 





Male 78 94 
Female 50 89 


What proportions of the patients of male and female physicians were advised? 
What are the standard errors of the proportions and their difference? Is the 
difference statistically significant? 

The following table shows the breakdown according to whether the physi- 
cian smokes: 


| Advised Not Advised 





Smoker 13 37 
Nonsmoker 115 146 


Of those patients who saw a smoking physician, what proportion were advised 
to quit, and of those who saw a nonsmoker, what proportion were so advised? 
What are the standard errors of the proportions and of their difference? Is the 
difference statistically significant? 

Finally, this table gives a breakdown by age of physician: 





Advised Not Advised 
< 30 88 128 
30-39 28 37 
> 39 12 18 


What are the proportions advised to quit in each of the three age categories and 
what are their standard errors? Are the differences statistically significant? 


23. Does heavy exercise increase the risk of myocardial infarction? Mittleman et al. 
(1993) studied this question by examining the activities of 1228 patients who 
had suffered myocardial infarctions. It was determined whether each patient had 
participated in heavy exertion in the hour before the onset of the infarction and 
also whether each had participated in heavy exertion at the same time the previous 
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day. Their results are displayed in the following table: 





Previous Day Day of Infarction 





Exertion No Exertion Total 


Exertion 4 9 13 
No Exertion 50 1165 1215 
Total 54 1174 1228 























Does the study demonstrate that heavy exertion is associated with myocar- 
dial infarction? How does the design of this study relate to that of the cell phone 
study in Example B of Section 13.5? 


24. Is it advantageous to wear the color red in a sporting contest? According to Hill 
and Barton (2005): 


Although other colours are also present in animal displays, it is specifi- 
cally the presence and intensity of red coloration that correlates with male 
dominance and testosterone levels. In humans, anger is associated with 
a reddening of the skin due to increased blood flow, whereas fear is as- 
sociated with increased pallor in similarly threatening situations. Hence, 
increased redness during aggressive interactions may reflect relative dom- 
inance. Because artificial stimuli can exploit innate responses to natural 
stimuli, we tested whether wearing red might influence the outcome of 
physical contests in humans. 

In the 2004 Olympic Games, contestants in four combat sports (box- 
ing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling) were 
randomly assigned red or blue outfits (or body protectors). If colour has no 
effect on the outcome of contests, the number of winners wearing red should 
be statistically indistinguishable from the number of winners wearing blue. 


They thus tabulated the colors worn by the winners in these contests: 




















Sport Red Blue 
Boxing 148 120 
Freestyle Wrestling 27 24 
Greco Roman Wrestling 25 23 
Tae Kwon Do 45 35 
Some supplementary information is given in the file red-blue.txt. 


a. Let zp denote the probability that the contestant wearing red wins. Test the 
null hypothesis that zr. = 1 versus the alternative hypothesis that zp is the 
same in each sport, but tr Z i 

b. Test the null hypothesis rp = 1 against the alternative hypothesis that allows 


tr to be different in different sports, but not equal to 1. 


540 Chapter 13 The Analysis of Categorical Data 


c. Are either of these hypothesis tests equivalent to that which would test the 
null hypothesis tr = 1 versus the alternative hypothesis tr + 1, using as 
data the total numbers of wins summed over all the sports? 

d. Is there any evidence that wearing red is more favorable in some of the sports 
than others? 

e. From an analysis of the points scored by winners and losers, Hill and Bar- 
ton concluded that color had the greatest effect in close contests. Data on the 
points of each match are contained in the file red-blue.x1s. Analyze this 
data and see whether you agree with their conclusion. 


25. The Physicians’ Health Study was a randomized, double-blind, placebo- 
controlled trial designed to determine whether low-dose aspirin (325 mg every 
other day) decreases cardiovascular mortality. The experiment assigned 11,037 
physicians at random to receive aspirin, and 11,034 to receive a placebo. 


a. The following table shows the incidence of cardiovascular events. What would 
you conclude about the effects of aspirin? 





Aspirin | Placebo 





Myocardial Infarction 


Fatal 10 26 

Nonfatal 129 213 
Stroke 

Fatal 9 6 

Nonfatal 110 92 

















b. The following tables details cardiovascular mortality. What would you con- 
clude about the effects of aspirin? 








Cause Aspirin | Placebo 
Acute myocardial infarction 10 28 
Other ischemic heart disease 24 25 
Sudden death 22 12 
Stroke 10 7 
Other cardiovascular 15 11 

















26. Insulin pumps are used by diabetic patients to control blood glucose levels, but a 
side effect, diabetic ketoacidosis (DKA), may occur. Mecklenburg et al. (1984) 
gathered data on incidence of DKA before and after pump therapy, shown in the 
following table. Test whether the rate of DKA is the same before and after therapy. 





After Therapy Before Therapy 





No DKA DKA 
No DKA 128 7 
DKA 19 7 

















27. 


28. 


29. 
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The data in the following table are taken from an article in the New York Times 
(April 20, 2001), *Victim's Race Affects Killer's Sentence." The data are from a 
study of all homicide cases in North Carolina for the period 1993—1997 in which 
it was possible that a murder conviction would result in the death penalty. Such 
data have played an important role in the debate about the death penalty in the 
U.S., the only wealthy western nation which imposes it. Qualitatively, what do 
you conclude from looking at the numbers? Discuss whether it is appropriate 
to use a chi-square test to test that the combination of the victim's race and the 
defendant's race was independent of whether the defendant received the death 
penalty for convicted murderers in North Carolina during the years 1993—1997. 





Defendant's Race Victim's Race Death Penalty No Death Penalty 





Not white White 33 251 
White White 33 508 
Not white Not white 29 587 
White Not white 4 76 




















In Section 13.3, a chi-square test of homogeneity was carried out on the fre- 
quencies of word counts in four works. The test used the actual counts (e.g, 147 
occurrences of the word “a” in Sense and Sensibility). Suppose that instead of 
the counts, the relative frequencies (e.g., 147/375 — 0.39) were presented in the 
table and the chi-square statistic was calculated using the relative frequencies 
rather than the counts. Would the value of the chi-square statistic be the same? 
What would happen if percentages were used? 


Suppose that a company wishes to examine the relationship of gender to job sat- 
isfaction, grouping job satisfaction into four categories: very satisfied, somewhat 
satisfied, somewhat dissatisfied, and very dissatisfied. The company plans to ask 
the opinions of 100 employees. Should you, the company's statistician, carry out 
a chi-square test of independence or a test of homogeneity? 
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CHAPTER 14 


Linear Least Squares 


Introduction 


In order to fit a straight line to a plot of points (x;, y;), where i = 1,...,n,the slope 
and intercept of the line y = fo + f1x must be found from the data in some manner. 
In order to fit a pth-order polynomial, p + 1 coefficients must be determined. Other 
functional forms besides linear and polynomial ones may be fit to data, and in order 
to do so parameters associated with those forms must be determined. 

The most common, but by no means only, method for determining the parameters 
in curve-fitting problems is the method of least squares. The principle underlying this 
method is to minimize the sum of squared deviations of the predicted, or fitted, 
values (given by the curve) from the actual observations. For example, suppose that 
a straight line is to be fit to the points ( y;, x;), where i = 1,...,7; y is called the 
dependent variable and x is called the independent variable, and we want to predict 
y from x. (This usage of the terms independent and dependent is different from their 
probabilistic meaning.) Sometimes x and y are called the predictor variable and the 
response variable, respectively. Applying the method of least squares, we choose 
the slope and intercept of the straight line to minimize 


S(fo, Bi) = 3 Oi — Bo — Bii 
i-l 

Note that 69 and £, are chosen to minimize the sum of squared vertical deviations, or 
prediction errors (see Figure 14.1). The procedure is not symmetric in y and x. 

Curves are often fit to data as part of the process of calibrating instruments. 
For example, Bailey, Cox, and Springer (1978) discuss a method for measuring the 
concentrations of food dyes and other substances by high-pressure chromatography. 
Measurements of the chromatographic peak areas corresponding to sulfanilic acid 
were taken for several known concentrations of FD&C Yellow No. 5. Figure 14.2 
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FIGURE 14.1 The least squares line minimizes the sum of squared vertical 
distances (dotted lines) from the points to the line. 
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FIGURE 14.2 Data points and the least squares line for the relation of sulfanilic 
acid peak area to percentage of FD&C Yellow. 


shows a plot of peak area versus percentage of FD&C Yellow. To casual examination, 
the plot looks fairly linear. 

Once the equation of the line was established, it could be used in estimating 
concentrations of the dye from measurements of peak area. 
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To find fo and £i, we calculate 


as : 
= = 72 (yi — Bo — Bixi) 
mh 
as n 
ap 7-2 Xit yi — Po — PiXi 
35 2, (i — Bo = Bix) 
Setting these partial derivatives equal to zero, we have that the minimizers f, and f, 


satisfy 


i» = nBot+ Bi >> xi 

i=1 i=l 
xy = Bo oxi + By a 
i=l i-l i-l 


Solving for Bo and f, we obtain 


ENEE) 69) 








Problem 10 at the end of the chapter asks you to derive the following useful equivalent 





expressions: 
Bo = 5 — Bix 
Ê= 1 — HO — y) 
p= " = 
a Oi = x)? 


The fitted line, with the parameters determined from the expressions above to be 
Bo = .073 and Bi = 10.8, is drawn in Figure 14.2. Is this a “reasonable” fit? How 
much faith do we have in these values for ĝo and f, since there is apparently some 
"noise" in the data? We will answer these questions in later sections of this chapter. 

Functional forms more complicated than straight lines are often fit to data. For 
example, to determine the proper placement in college mathematics courses for enter- 
ing freshmen, data that have a bearing on predicting performance in first-year calculus 
may be available. Suppose that score on a placement exam, high school grade-point 
average in math courses, and quantitative college board scores are available; we can 
denote these values by x, x2, and x3, respectively. We might try to predict a student's 
grade in first-year calculus, y, by the form 


y © Bo + Bixi + faxo + 3x3 
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where the £; could be estimated from data on the performance of students in previous 
years. It could be ascertained how reliable this prediction equation was, and if it were 
sufficiently reliable, it could be used in a counseling program for entering freshmen. 

In biological and chemical work, it is common to fit functions of the following 
form to decay curves: 


f(t) = Ae~™ + Be P! 


Note that the function f is linear in the parameters A and B and nonlinear in the 
parameters o and f. From data (y;, tj), i = 1,...,7, where, for example, y; is the 
measured concentration of a substance at time t;, the parameters are determined by 
the method of least squares as being the minimizers of 


S(A, B, a, B) = y» — Ae ?ti — Be- Py? 


i-i 


In fitting periodic phenomena, functions of the following form occur: 


f (t) = Acos wt + B sin wt + C cos wt + D sin œt 


This function is linear in the parameters A, B, C, and D and nonlinear in the parameters 
€, and wz. 

When the function to be fit is linear in the unknown parameters, the minimization 
is relatively straightforward, because calculating partial derivatives and setting them 
equal to zero produces a set of simultaneous linear equations that can be solved in 
closed form. This important special case is known as linear least squares. If the 
function to be fit is not linear in the unknown parameters, a system of nonlinear 
equations must be solved to find the coefficients. Typically, the solution cannot be 
found in closed form, so an iterative procedure must be used. 

For our purposes, the general formulation of the linear least squares problem is 
as follows: A function of the form 


f Qu. X2... Xp-1) = Bo + Bixi + Baxa +--+ + Bp-1Xp-1 


involving p unknown parameters, Bo, 1, B2,..., Bp—1, is to be fit to n data points, 
Vis X11, X125 XIpp-1 
Y2, X21, X22, ---, X2, p-1 
Yn» Xnls Xn25 ---, Xn, p-1 


The function f (x) is called the linear regression of y on x. We will always assume 
that p < n, that is, that there are fewer unknown parameters than observations. 
Fitting a straight line clearly follows this format. A quadratic can be fit in this way by 
setting x, = x, and x» = x°. If the frequencies in the trigonometric fitting problem 
referred to above are known, we can let xq = cos wit, x» = sin@,t, x3 = COS ®t, 
and x4 = sin œt and identify the unknown amplitudes A, B, C, and D as the §;. If 
the frequencies are unknown and must be determined from the data, the problem is 
nonlinear. 
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Many functions that are not initially linear in the unknowns can be put into linear 
form by means of a suitable transformation. An example of this type of function that 
occurs frequently in chemistry and biochemistry is the Arrhenius equation, 


a= Ce &4KKT) 
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FIGURE 14.3 A plot of log rate versus 1/T for a reaction involving atomic oxygen. 
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FIGURE 14.4 A plot of rate versus temperature for a reaction involving atomic 
oxygen. 
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Here, o is the rate of a chemical reaction, C is an unknown constant called the 
frequency factor, e4 is the activation energy of the reaction, K is Boltzmann's constant, 
and T is absolute temperature. If a reaction is run at several temperatures and the rate 
is measured, the activation energy and the frequency factor can be estimated by fitting 
the equation to the data. The function as written above is linear in the parameter C 
and nonlinear in e4, but 


1 
l = log C —e4—— 
oga = log ea T 


is linear in log C and e4. As an example, a plot of log rate versus 1/T for a reaction 
involving atomic oxygen, taken from Huie and Herron (1972), is shown in Figure 14.3. 
Figure 14.4 is a plot of rate versus temperature, which, in contrast, is quite nonlinear. 


Simple Linear Regression 


This section deals with the very common problem of fitting a straight line to data. 
Later sections of this chapter will generalize the results of this section. First, statistical 
properties of least squares estimates will be discussed and then methods of assessing 
goodness of fit, largely through the examination of residuals. Finally, the relation of 
regression to correlation is presented. 


Statistical Properties of the Estimated Slope 
and Intercept 


Up to now we have presented the method of least squares simply as a reasonable 
principle, without any explicit discussion of statistical models. Consequently, we have 
not addressed such pertinent questions as the reliability of the slope and intercept in 
the presence of “noise.” In order to address this question, we must have a statistical 
model for the noise. The simplest model, which we will refer to as the standard 
statistical model, stipulates that the observed value of y is a linear function of x plus 
random noise: 


yi = o + Bixi + ei, i=l,...,n 


Here the e; are independent random variables with E(e;) = 0 and Var(e;) = o?. The 
x; are assumed to be fixed. 

In Section 14.1, we derived formulas for the slope, fi. and the intercept, Bo. 
Referring to those equations, we see that they are linear functions of the y;, and 
thus linear functions of the e;. Bo and fi are estimates of f, and f. The standard 
statistical model thus makes computation of the means and variances of Bo and f 
straightforward. 
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THEOREMA 


Under the assumptions of the standard statistical model, the least squares esti- 
mates are unbiased: E(B;) = Bs, tor yf =O, L 


Proof 


From the assumptions, E( y;) = fo + fixi. Thus, from the equation for Bo in 


Section 14.1, 
(5 x) (5 EO») = (5 au) (> mE) 








E(Bo) = 7 " 2 
nox — ( s) 
i-l i=l 
(Ex) (nda + mxx)-( x) (60351 + mx) 
i=l i=l i=l i=l i=l 
m n n 2 
nyo - (5x) 
i=l i=l 
=o 
The proof for B, is similar. E 


Note that the proof of Theorem A does not depend on the assumptions that the e; 
are independent and have the same variance, only on the assumptions that the errors 
are additive and E(e;) = 0. 

From the standard statistical model, Var( y;) = o? and Cov( y;, yj) = 0, where 
i Æ j. This makes the computation of the variances of the f; straightforward. 


THEOREM B 


Under the assumptions of the standard statistical model, 


cx 
Var(Bo) = — ——— 





Var(B,) = 
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Proof 
From a form for f, given in Section 14.1 


Yu-30-3 e 
Dese i=l 





Sa- oa — 3) 
i=l i-1 


The identity for the numerator follows from expanding the product and using 
Y — X) = 0. We then have 
2 


A o 
Var(B1) = —— — — 
r= ae 
i=l 
which reduces to the desired expression. The other expressions may be derived 
similarly. Later we will give a more general proof. a 


From Theorem B, we see that the variances of the slope and intercept depend 
on the x; and on the error variance, 0”. The x; are known; therefore, to estimate 
the variance of the slope and intercept, we need to estimate only o?. Since, in the 
standard statistical model, o? is the expected squared deviation of the y; from the 
line By + Bxj, it is natural to base an estimate of o? on the average squared devia- 
tions of the data about the fitted line. We define the residual sum of squares (RSS) 
to be 


RSS = M "(yi — fo — fixi 
i=1 


We will show in Section 14.4.3 that 
RSS 
s$ = 
n—2 
is an unbiased estimate of o?. The divisor n — 2 is used rather than n because two 
parameters have been estimated from the data, giving n — 2 degrees of freedom. 
The variances of Bo and f, as given in Theorem B are thus estimated by replacing 


c? by s?, yielding estimates that we will denote Sh, and sj, 


If the errors, e;, are independent normal random variables, then the estimated 
slope and intercept, being linear combinations of independent normally distributed 
random variables, are normally distributed as well. More generally, if the e; are 
independent and the x; satisfy certain assumptions, a version of the central limit 
theorem implies that, for large n, the estimated slope and intercept are approximately 
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EXAMPLEA 


14.2.2 


EXAMPLEA 


normally distributed. The normality assumption, or its approximation, makes possible 
the construction of confidence intervals and hypothesis tests. It can then be shown 
that 
Bi — Bi 
5f; 
which allows the ¢ distribution to be used for confidence intervals and hypothesis 
tests. 


^ 


n—2 


We apply these procedures to the 21 data points on chromatographic peak area. The 
following table presents some of the statistics from the fit (tables like this are produced 
by regression programs of software packages): 








Coefficient Estimate Standard Error t Value 
Po .0729 .0297 2.45 
Pi 10.77 27 40.20 





The estimated standard deviation of the errors is s = .068. The standard error of the 
intercept is Sg, = .0297. A 95% confidence interval for the intercept, £o, based on 
the ¢ distribution with 19 df is 


Bo x t;9(.025)sg, 
or (.011, .135). Similarly, a 95% confidence interval for the slope, 61, is 
Bi + to(.025) S5, 





or (10.21, 11.33). To test the null hypothesis Ho: fo = 0, we would use the t statistic 
Bo /sg, = 2.45. The hypothesis would be rejected at significance level œ = .05, so 
there is strong evidence that the intercept is nonzero. E 


Assessing the Fit 


As an aid in assessing the quality of the fit, we will make extensive use of the residuals, 
which are the differences between the observed and fitted values: 


êi = yi — Bo — Bixi 
Itis most useful to examine the residuals graphically. Plots of the residuals versus the 
x values may reveal systematic misfit or ways in which the data do not conform to 


the fitted model. Ideally, the residuals should show no relation to the x values, and 
the plot should look like a horizontal blur. 


Figure 14.5 is a plot of the residuals for the data on chromatographic peak area. 
There is no apparent deviation from randomness in the residuals, so this plot confirms 
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FIGURE 14.5 A plot of residuals for the data on chromatographic peak area. 


the impression from Figure 14.2 that it is reasonable to model the relation as 
linear. | 


We next consider an example in which the plot shows curvature. 


EXAMPLE B The data in the following table were gathered for an environmental impact study that 
examined the relationship between the depth of a stream and the rate of its flow (Ryan, 
Joiner, and Ryan 1976). 





Depth Flow Rate 





34 .636 
.29 319 
28 .734 
42 1.327 
29 487 
Al 924 
-16 7.350 
43 5.890 
.46 1.979 
.40 1.124 





A plot of flow rate versus depth suggests that the relation is not linear (Fig- 
ure 14.6). This is even more immediately apparent from the bowed shape of the plot 
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FIGURE 14.6 A plot of flow rate versus stream depth. 
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FIGURE 14.7 Residuals from the regression of flow rate on depth. 


of the residuals versus depth (Figure 14.7). In order to empirically linearize rela- 
tionships, transformations are frequently employed. Figure 14.8 is a plot of log rate 
versus log depth, and Figure 14.9 shows the residuals for the corresponding fit. There 
is no sign of obvious misfit. (The possibility of expressing flow rate as a quadratic 
function of depth will be explored in a later example.) L| 
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FIGURE 14.8 Plot of log flow rate versus log depth. 
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FIGURE 14.9 Residuals from the regression of log flow rate on log depth. 
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EXAMPLEC 


We have seen that one of the assumptions of the standard statistical model is 
that the variance of the errors is constant and does not depend on x. Errors with this 
property are said to be homoscedastic. If the variance of the errors is not constant, 
the errors are said to be heteroscedastic. If in fact the error variance is not constant, 
standard errors and confidence intervals based on the assumption that s? is an estimate 
of o? may be misleading. 


In Problem 65 at the end of Chapter 7, data on the population and number of breast 
cancer mortalities in 301 counties were presented. A scatterplot of the number of cases 
( y) versus population (x) is shown in Figure 14.10. This plot appears to be consistent 
with the simple model that the number of cases is proportional to the population size, 
or y © Bx. (We will test whether or not the intercept is zero below.) Accordingly, we 
fita model with zero intercept by least squares to the data, yielding Ê = 3.559 x 107°. 
(See Problem 15 at the end of this chapter for fitting a zero intercept model.) Figure 
14.11 shows the residuals from the regression of the number of cases on population 
plotted versus population. Since it is very hard to see what is going on in the left-hand 
side of this plot, the residuals are plotted versus log population in Figure 14.12, from 
which it is quite clear that the error variance is not constant but grows with population 
size. 

The residual plot in Figure 14.12 shows no curvature but indicates that the vari- 
ance is not constant. For counted data, the variability often grows with the mean, 
and frequently a square root transformation is used in an attempt to stabilize the 
variance. We therefore fit a model of the form /y ~ y./x. Figure 14.13 shows 
the plot of residuals for this fit. The residual variability is more nearly constant 
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FIGURE 14.10  Scatterplot showing breast cancer mortality versus population. 
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FIGURE 14.11 Residuals from the regression of mortality on population. 
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FIGURE 14.12 A plot of residuals versus log population. 


here; 6 is estimated by the square of the slope, ?, which for this example gives 
B=? = 3.471 x 1073. 

Finally, we note that the zero intercept model can be tested in the following way. 
A linear regression on a square root scale is calculated with both slope and intercept 
terms, and the intercept is found to be .066 with a standard error sj, = 9.74 x 107°. 
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FIGURE 14.13  Residuals from the regression of the square root of mortality on the 
square root of population. 


The f statistic for testing Ho: yo = 0 is 


The null hypothesis cannot be rejected for these data. a 


A normal probability plot of residuals may be useful in indicating gross departures 
from normality and the presence of outliers. Least squares estimates are not robust 
against outliers, which can have a large effect on the estimated coefficients, their 
standard errors, and s, especially if the corresponding x values are at the extremes of 
the data. It can happen, however, that an outlier with an extreme x value will pull the 
line toward itself and produce a small residual, as illustrated in Figure 14.14. 


Figures 14.15 and 14.16 are normal probability plots of the residuals from the fits of 
Example C. For Figure 14.15, the residuals are from the ordinary linear regression 
with zero intercept; for Figure 14.16, the residuals are from the zero intercept model 
with the square root transformation. Note that the distribution in Figure 14.16 is more 
nearly normal (although there is a hint of skewness) and that the distribution in Figure 
14.15 is heavier-tailed than the normal distribution because of the presence of the large 
residuals from the heavily populated counties. E 
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FIGURE 14.14 An extreme x value exerts great leverage on the fitted line and 
produces a small residual at that point. 
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FIGURE 14.15 Normal probability plot of the residuals from the regression of 
mortality on population. 


It is often useful to plot residuals against variables that are not in the model but 
might be influential. If the data were collected over a period of time, a plot of the 
residuals versus time might reveal unexpected time dependencies. 

We conclude this section with an extended example. 
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FIGURE 14.16 Normal probability plot of the residuals from the regression of the 
square root of mortality on the square root of population. 


EXAMPLE E Houck (1970) studied the bismuth I-II transition pressure as a function of temper- 
ature. The data are listed in the following table (Residuals have been rescaled to 
have standard deviation equal to 1—this process is discussed further in Section 





14.4.4). 

Pressure (bar) Temperature (^C) Standardized Residual 
25366 20.8 1.67 
25356 20.9 1.48 
25336 21.0 :97 
25256 21.9 .40 
25267 22.1 22 
25306 22.1 1.46 
25237 22.4 —.35 
25267 22:5 74 
25138 24.8 —.34 
25148 24.8 —.02 
25143 25.0 .08 
24731 34.0 —1.20 
24751 34.0 —.57 
24771 34.1 19 
24424 42.7 46 
24444 42.7 1.11 
24419 42.7 30 


(Continued) 
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Pressure (bar) Temperature (°C) Standardized Residual 
24117 49.0 AS 
24102 50.1 —.08 
24092 50.1 —.42 
25202 22.5 —1.33 
25157 23.1 —1.97 
25157 23.0 —2.10 





From Figure 14.17, a plot of the tabulated data, it appears that the relationship is 
fairly linear. The least squares line is 





Pressure = 26172(+21) — 41.3(+.6) x temperature 





where the estimated standard errors of the parameters are given in parentheses. The 
residual standard deviation is s = 32.5 with 21 df. An approximate 95% confidence 
interval for the slope is 


Pi + sp, t1 (.025) 
or (40.05, 42.55). 
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FIGURE 14.17 A plot of bismuth I-II transition pressure versus temperature. 


In order to check how well the model fits, we look at a plot of the standard- 
ized residuals versus temperature (Figure 14.18). The plot is rather odd. At first 
glance, it appears that the variability is greater at lower temperatures. (Bear in mind 
that the error variance was assumed to be constant in the derivation of the statis- 
tical properties of 6.) There is another possible explanation for the wedge-shaped 
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FIGURE 14.18 A plot of standardized residuals versus temperature. 


appearance of the residual plot. The table reveals that the data were apparently col- 
lected in the following way: three measurements at about 21°C, five at about 22°C, 
three at about 25°C, three at about 34°C, three at about 43°C, three at about 50°C, 
and three at about 23°C. It is quite possible that the measurements were taken in 
the order in which they are listed. We can circle these groups of measurements on 
the residual plot and note the offsets among them. The last three measurements, 
taken at about 23°C, particularly stand out, and the three taken at 43°C appear out 
of line with those at 34°C and 50°C. A plausible explanation for this pattern is as 
follows: The experimental equipment was set up for a given temperature and sev- 
eral measurements were made; then the equipment was set for another temperature 
and more measurements were made, and so on; at each setting, errors were intro- 
duced that affected every measurement at that temperature. Calibration errors are a 
possibility. 

The standard statistical model, which assumes that the errors at each point are 
independent, does not provide a faithful representation of such a phenomenon. The 
standard errors given above for f, and f, and the confidence interval for f, are clearly 
suspect. (Recall, however, that the estimates Bo and B | are unbiased even if the errors 
are dependent.) E 


Correlation and Regression 


There is a close relationship between correlation analysis and fitting straight lines by 
the least squares method. Let us introduce some notation: 
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1 3 = 
Sxy = 3 SoG X)Cyi » 





The correlation coefficient between the x's and y's is 


Sxy 
[SxxSyy 


The slope of the least squares line is (see Problem 10 at the end of this chapter) 


r= 


Sey 


het 


Sx. X 


and therefore 


In particular, the correlation is zero if and only if the slope is zero. 

To further investigate the relationship of correlation and regression, it is instruc- 
tive to standardize the variables. If in the regression equation $ = fo + Ax the 
coefficients are expressed as 











Bo = y— Pix 
5 1 — Oi — y) 
By = n =\2 
Saw ex^ 
and f is expressed in terms of r as above, then after some manipulation, we arrive at 
$—-» ,x-Xx 
=r 


VSyy 7 Sxx 


(You should check this calculation.) The equation can be interpreted as follows: 
Suppose that r > 0 and that x, the predictor variable, is one standard deviation 
greater than its average; then the predicted value of y is r standard deviations bigger 
than its average, r < 1. The predicted value thus deviates from its average by fewer 
standard deviations than does the predictor. In units of standard deviations, it is closer 
to its average than is the predictor. 

The term regression stems from the work of Sir Francis Galton (1822-1911), a 
famous geneticist who studied the sizes of seeds and their offspring and the heights 
of fathers and their sons. In both cases, he found that the offspring of parents of larger 
than average size tended to be smaller than their parents and that the offspring of 
parents of smaller than average size tended to be larger than their parents. He called 
this phenomenon “regression towards mediocrity.” This is exactly what the regression 
line predicts, as in the previous paragraph. 
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EXAMPLEA 


EXAMPLE B 


Figure 14.19 (from Freedman, Pisani, and Purves, 1998) is a scatterplot of the heights 
of 1078 pairs of fathers and sons. The fathers’ average height is 67.7 in. with a standard 
deviation of 2.74 in.; the sons' average and standard deviation are 68.7 in. and 2.81 
in., respectively; the correlation coefficient is 0.501. The solid line, in the figure is 
the regression line, and the dashed one is the line y = x + 1 (since the sons are | in. 
taller than the fathers on average). Notice how the prediction son's height — father's 
height + 1 under-predicts on the left and over-predicts on the right. 
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FIGURE 14.19 A scatterplot of the heights of 1078 sons versus the heights of their 
fathers. 


In the vertical strip on the right, the fathers’ heights are 72 in. to the nearest 
inch; the average height of the sons in that strip is 71 in.—one inch shorter than their 
fathers'. The regression line is 


$— 687 x —67.7 
——X t x ———— 
2.81 2.74 


Evaluating this for x = 72 predicts the sons’ height to be 70.9 in., which is very close 
to the empirical average in the strip. 

In the vertical strip on the left, the fathers’ heights are 64 in. to the nearest inch, 
and the average height of a son in that strip is 67 in.—three inches taller than their 
fathers’. The prediction from the regression line is 66.8 in. E 


Statistics from the sport of baseball have been extensively gathered and studied; 
the statistical analysis of baseball records is called “sabermetrics.” (See Albert and 
Bennett, 2003.) Analysis has shown that one of the key statistics relating to a player’s 
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FIGURE 14.20 The left panel is a scatterplot of the on-base-percentage of 148 American 
League players in 2002 versus their percentages in 2001. In the right panel, the change is plotted 
versus on-base-percentage in 2001. 


offensive effectiveness is the percentage of the time he gets on base. The left panel of 
Figure 14.20 shows the on-base percentage for all American League players in 2001 
and 2002 with at least 100 plate appearances each season. There is a strong correlation 
(r — 0.62) between the players' performances in the two consecutive seasons. The 
right panel of the figure shows the difference (2002—2001) plotted against the 2001 
performance. Observe that the scatterplot exhibits a negative slope—players who did 
relatively poorly in 2001 tended to improve in 2002, whereas those who did relatively 
well in 2001 tended to worsen in 2002. E 


We have already encountered the phenomenon of regression in Example B in 
Section 4.4.1, where we saw that if X and Y follow a bivariate normal distribution 
with ox = oy = 1, the conditional expectation of Y given X does not lie along 
the major axis of the elliptical contours of the joint density; rather, E(Y|X) = pX. 
Regression to the mean was also discussed in Example B of Section 4.4.2. 

The regression effect must be taken into account in test-retest situations. Suppose, 
for example, that a group of preschool children are given an IQ test at age four and 
another test at age five. The results of the tests will certainly be correlated, and 
according to the analysis above, children who do poorly on the first test will tend to 
score higher on the second test. If, on the basis of the first test, low-scoring children 
were selected for supplemental educational assistance, their gains might be mistakenly 
attributed to the program. A comparable control group is needed in this situation to 
tighten up the experimental design. 
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14.3 The Matrix Approach to 


Linear Least Squares 


With problems more complex than fitting a straight line, it is very useful to ap- 
proach the linear least squares analysis via linear algebra. As well as providing 
a compact notation, the conceptual framework of linear algebra can generate the- 
oretical and practical insights. Developments in numerical analysis have resulted 
in the availability of high-quality software packages; see, for example, LINPACK 
(www.netlib.org/linpack/) 

Suppose that a model of the form 


y = Bo + Bixi d Bp-1Xp-1 


is to be fit to data, which we denote as 


Vis Xil» Xi2s «Xi, p—1> S= leena 
The observations y;, where i = 1,...,n, will be represented by a vector Y. The 
unknowns, fo, ..., B1, will be represented by a vector f. Let X„xp be the matrix 
l xui Xm cc XLpa 
x= l xa Xo c Xp- 
1 Xni Xq2 t Xnp-1 


For a given £, the vector of fitted or predicted values, Y, can be written 


^ 


¥ =X 8B 


nxl nxp pxl 


(Verify this by writing out explicitly the first row of the system of equations.) The 
least squares problem can then be phrased as follows: Find f to minimize 





S(B) = M Oi — Bo — Bixa — +++ — Bp-1%i, sa? 
i=l 
= IY - X£]? 
= Y - €i? 


n 
: 2 2 
(If u is a vector, ||u||* = > u;.) 
i= 


EXAMPLEA 
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Let us consider fitting a straight line, y = fo + fix, to points ( y;, x;), where i = 
1,..., n. In this case, 


yı 
y» 

Y= . 
Yn 

B | 

ps a 
1 X1 
1 X2 

X=]... 
1 Xn 

and 
yi — Bo — Bix 
y2 — Bo — fixo 
Y-Xg = : m 

Yn — Po m BiXn 


Returning to the general case, if we differentiate S with respect to each f, and 
set the derivatives equal to zero, we see that the minimizers fo, ..., B,—1 satisfy the 
p linear equations 


nBo + Êi So xi Te + By X = So yi 
i=l i-l 


i=l 


Bo xa +B, xaxa tec Boy Se dat od = you k =1,...,p—1 
i=l i=l i=l i=l 
These p equations can be written in matrix form 
X'XB = XTY 
and are called the normal equations. If X7 X is nonsingular, the formal solution is 
Ê= XTX) XTY 


We stress that this is a formal solution; computationally, it is sometimes unwise 
even to form the normal equations because the multiplications involved in form- 
ing XTX can introduce undesirable round-off error. Alternative methods of finding 
the least squares solution B are developed in Problems 8 and 9 at the end of this 
chapter. 
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EXAMPLE B 


Linear Least Squares 


The following lemma gives a criterion for the existence and uniqueness of solu- 
tions of the normal equations. 


LEMMAA 
X7 X is nonsingular if and only if the rank of X equals p. 


Proof 
First suppose that X" X is singular. There exists a nonzero vector u such that 
X7 Xu = 0. Multiplying the left-hand side of this equation by u”, we have 
0=u'X’Xu 
= (Xu)! (Xu) 
so Xu — 0, the columns of X are linearly dependent, and the rank of X is less 


than p. 
Next, suppose that the rank of X is less than p so that there exists a nonzero 


vector u such that Xu = 0. Then X7 Xu = 0, and hence X’ X is singular. E 


For example, suppose that a straight line is to be fitted to the points (y;, x;), 
i = 1, 2, 3. Then the design matrix is 


xi 
X= 1 X2 
1 X3 
If x, = xX. = x3, the matrix is singular since the two columns are proportional to 


each other. In this case, we would be trying to fit a line to a single point. You should 


calculate X7 X and check that it is singular. 
The vector B = (X^ X)! X"Y is the vector of fitted parameters, and the corre- 


sponding vector of fitted, or predicted, y values is Y- Xf. The residuals Y — Y — 
Y — Xf are the differences between the observed and fitted values. We will make use 


of these residuals in examining goodness of fit. 


Returning to Example A on fitting a straight line, we have 


xxl Las | 
1 


X1 Perec Xn 
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(XTX)! = ml jel 





X'Y = i 


I 
R 
$, 
ls 
| 
x 
$ 
"e 





1 Go) (E)- (59) Gov) 


eG) | Eo ESE 


i=l i=l 





which agrees with the earlier calculation. E 


14.4 Statistical Properties of Least 
Squares Estimates 


In this section, we develop some statistical properties of the vector B, which is found 
by the least squares method, under some assumptions on the vector of errors. In order 
to do this, we must use concepts and notation for the analysis of random vectors. 


14.4.1 Vector-Valued Random Variables 


In Section 14.3, we found expressions for least squares estimates in terms of matrices 
and vectors. We now develop methods and notation for dealing with random vectors, 
vectors whose components are random variables. These concepts will be applied to 
finding statistical properties of the least squares estimates. 
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We consider the random vector 


5 
the elements of which are jointly distributed random variables with 
E(Yi) = ii 
and 
Cov(Y;, Y;) = oij 
The mean vector is defined to be simply the vector of means, or 


Hı 
E(Y) = my = 
m 
The covariance matrix of Y, denoted X, is defined to be an n x n matrix with the ij 


element o;;, which is the covariance of Y; and Y;. Note that X is a symmetric matrix. 
Suppose that 


Z = c +A Y 


mxl m x l mxnnx 1 


is another random vector formed from a fixed vector, c, and a fixed linear transforma- 
tion, A, of the random vector Y. The next two theorems show how the mean vector 
and covariance matrix of Z are determined from the mean vector and covariance 
matrix of Y and the matrix A. Each of the theorems is followed by two examples; the 
results in those examples could easily be derived without using matrix algebra, but 
they illustrate how the matrix formalisms work. 


THEOREMA 


If Z = c + AY, where Y is a random vector and A is a fixed matrix and c is a 
fixed vector, then 


E(Z) = c - AE(Y) 


Proof 
The ith component of Z is 


Zi = i + X aij¥j 
"zl 


EXAMPLEA 


EXAMPLEB 
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By the linearity of the expectation, 


E(Zi) — ci + S > a; E(Y;) 
j=l 
Writing these equations in matrix form completes the proof. a 


n 


As a simple example, let us consider the case where Z = POS a; Y;. In matrix 


notation, this can be written Z = aT Y. According to Theorem A, 


E(Z) cau — aipu 


i-i 


as we already knew. E 


As another example, let us consider a moving average. Suppose that Z; = Y; + Yi+1, 


fori = 1,..., n — 1. We can write this in matrix notation as Z = AY where A is the 
matrix 

1100 0 0 

O0 1 1 0 0 0 

0 0 0 0 l 4 


Using Theorem A to find E(Z), it is easy to see that Aw has ith component 
Hi + Wig a 


THEOREM B 


Under the assumptions of Theorem A, if the covariance matrix of Y is D yy» then 
the covariance matrix of Z is 


37; = AXyyAT 
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EXAMPLEC 


EXAMPLED 


Proof 
The constant c does not affect the covariance. 


Cov(Zi, Z;) = om (9 (Pod ec Sam) 


k=1 al 


= > DS aika j/Cov(Y;, Y) 


k=1 [=1 
n n 
= DE ps Qik OKI Gj1 
k=1 [=1 
The last expression is the i j element of the desired matrix. a 


Continuing Example A, suppose that the Y; are uncorrelated with constant variance 
c?. The covariance matrix of Y can then be expressed as Xyy = oI, where I is the 
identity matrix. The role of A in Theorem B is played by a” . Therefore, the covariance 
matrix of Z, which is a 1 x 1 matrix in this case, is 


n 
2 2 
Ezz =o'a a = o? J a; " 
i=] 


Suppose that the Y; of Example B have the covariance matrix o7I. Then Xz; = 
co? AT A, or 


2100 0 
1210 0 
o° | 0 


= 
N 
TE 
© 
igi 


The proofs of both these theorems are straightforward, although the unfamiliarity 
of the notation may present a difficulty. But one of the advantages of using matrices 
and vectors when dealing with collections of random variables is that this notation is 
much more compact and easier to follow once one has mastered it, because all the 
subscripts have been suppressed. 

Let A be a symmetric n x n matrix and x an n vector. The expression 


n n 


x’ Ax = , , XidijXj 


i=1 j=l 


EXAMPLEE 
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is called a quadratic form. We will next calculate the expectation of a quadratic form 
in the case where x is a random vector. 


THEOREM C 


Let X be a random n vector with mean jt and covariance X, and let A be a fixed 
matrix. Then 


E(X' AX) = trace(AX) + Ay Ap 


Proof 

The trace of a square matrix is defined to be the sum of its diagonal terms. Since 
E(XiX;) = oij + Milj 

we have that 


E (>: 20 = m > 0;jdi; + ` D HiH jdij 


m Ei DUE E 
= trace(AX) + A Au o 


Consider E I (X; — X)?], where the X; are uncorrelated random variables with 
common mean u. We recognize that this is the squared length of a vector AX for 
some matrix A. To figure out what A must be, we first note that X can be expressed as 


— | 
X =-1'X 
n 


where 1 is a vector consisting of all ones. The vector consisting of entries all of which 
are X can thus be written as (1/ n)117 X, and A can be written as 


1 
A-I- -117 


n 
Thus, 

0G - Xy = |AXI? = XTATAX 

i=l 
The matrix A has some special properties. In particular, A is symmetric, and A? = A, 
as can be verified by simply multiplying A by A, noting that 171 = n. Thus, 

X' AT AX = X' AX 
and by Theorem C, 
E(X' AX) = o?trace(A) + m7 Ap 

Since w can be written as y = ul, it can be verified that Au = 0. Also, trace 
A =n — l, so the expectation above is o*(n—1). | 
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If Y,  ; and Zn | are random vectors, the cross-covariance matrix of Y and Z 
is defined to be the p x m matrix Xyz with the ij element o;; = Cov(Y;, Z;). 

The entries of the cross-covariance matrix quantify the strengths of linear re- 
lationships between the elements of Y and Z. The covariance of Y; and Z; can be 
converted to a correlation coefficient by dividing by the product of the standard devi- 
ations of Y; and Z;. 


THEOREM D 


Let X be a random vector with covariance matrix Nyy. If 


Y= AX 
pxn 
and 
Z= BX 


where A and B are fixed matrices, the cross-covariance matrix of Y and Z is 


Xy; = ALyxB" 


Proof 


The proof follows the lines of that of Theorem B (you should work it through for 
yourself). E 


EXAMPLE F LetX bea random n vector with E(X) = u1 and Xyx = o?l. Let Y = X, and let Z 
be the vector with ith element X; — X. We will find Xzy, an n x 1 matrix. In matrix 
form, 


From Theorem D, 


Xzy- (1 = un) (c?D (1) 
n n 


which becomes an n x 1 matrix of zeros after multiplying out. Thus, the mean X 
is uncorrelated with each of X; — X, i = 1,..., n. In the case that the elements of 
X are normal random variables, for which being uncorrelated implies independence, 
this result implies Theorem A of Section 6.3 and hence that X and S? are independent 
(Corollary A of Section 6.3). a 


14.4.2 
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Mean and Covariance of Least Squares Estimates 


Once a function has been fit to data by the least squares method, it may be necessary 
to consider the stability of the fit and of the estimated parameters, since if the mea- 
surements were to be taken again they would often be slightly different. To address 
the question of the variability of least squares estimates in the presence of noise, we 
will use the following model: 

p-1 


Y; = fot M Bix +i, ped3gH 


j=l 
where the e; are random errors with 
E(e;) =0 
Var(e;) = o? 
Cov(e;, ej) = 0, ix~j 
In matrix notation, we have 


Y = X p +e 


nxli nxppxli nxil 
and 
E(e) = 0 
X, = o?I 


In words, the y measurements are equal to the true values of the function plus random, 
uncorrelated errors with constant variance. Note that in this model, the X’s are fixed, 
notrandom. A useful theorem follows immediately from Theorem A of Section 14.4.1. 


THEOREM A 


Under the assumption that the errors have mean zero, the least squares estimates 
are unbiased. 


Proof 
The least squares estimate of B is 
B = XTX) XTY 
= (XX) 'X" (XB + e) 
= B+ (X'X) X'e 
From Theorem A of Section 14.4.1, 
EB = B + (XTX) XT Ele) 
=f u 


It should be noted that the only assumption on the errors used in this proof of 
Theorem A is that they have mean zero. Thus, even if the errors are correlated and 
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have a nonconstant variance, the least squares estimates are unbiased. The covariance 
matrix of f can also be calculated; the proof of the following theorem does depend 
on assumptions concerning the covariance of the errors. 


THEOREM B 


Under the assumption that the errors have mean zero and are uncorrelated with 
constant variance o?, the covariance matrix of the least squares estimate f is 


Uer. e. Om 
Proof 
From Theorem B of Section 14.4.1, the covariance matrix of B is 


(X X) X AXX X). 
= o? (XOX) 


YXjj 


since the covariance matrix of e is o?I, and X" X and therefore (X’ X)~! as well 
are symmetric. El 


These theorems generalize Theorems A and B of Section 14.2.1. Note how the 
use of matrix algebra simplifies the derivation. 


We return to the case of fitting a straight line. From the computation of (X7 X)^! in 
Example B in Section 14.3, we have 


2 > xc 


o? 


2 
n n 
nda? (5x) =L“ n 
i=l i= 





upp = 


Therefore, 











Cov(Bo, B1) = = m 
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14.4.3 Estimation of o? 


In order to use the formulas for variances developed in the preceding section (to form 
confidence intervals, for example), c? must be known or estimated. In this section, 
we develop an estimate of o?. 

Because o? is the expected squared value of an error, e;, it is natural to use the 
sample average squared value of the residuals. The vector of residuals is 


é=Y-Y 
- Y - xf 
—-Y-X(X'X)'X'Y 
or 
ê= Y — PY 


where P = X(XTX)^! X7 is ann x n matrix. 
Two useful properties of P are given in the following lemma (you should be able 
to write out its proof). 


LEMMA A 
Let P be defined as before. Then 
p-p-p 


A- P) = A- P) = A- P? a 


Since P has the properties given in this lemma, it is a projection matrix—that is, 
P projects on the subspace of R” spanned by the columns of X. Thus, we may think 


A 


geometrically of the fitted values, Y, as being the projection of Y onto the subspace 
spanned by the columns of X. However, we will not pursue the implications of this 
geometrical interpretation. 
The sum of squared residuals is, using Lemma A, 
3 - hy = IY - PYI? 
i=l 
= |a- bi 
= Y" (1 — P)" (I — P)Y 
= Y'(- Pv 


From Theorem C of Section 14.4.1, we can compute the expected value of this 
quadratic form: 


E[Y’ d — P)Y] = [E(Y)]' a — P)[E(Y)] + o?trace(d — P) 
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Now E(Y) = Xf, so 
(I — P)E(Y) = [I- X(X"X) ! X"]X£ 
=0 
Furthermore, 
trace(I — P) = trace(I) — trace(P) 


and, using the cyclic property of the trace—that is, trace(AB) = trace(BA)—we have 


trace (P) = trace [X(X" X)! X7] 
= trace [X X(X^ X) !] 


= trace I =p 


Since trace(L, x ,) = n, we have shown that 
E(|Y — YI!) = (n — po? 


and have proved the following theorem. 


THEOREMA 


Under the assumption that the errors are uncorrelated with constant variance o°, 
an unbiased estimate of o? is 
7/2 
2_ IY—Yhl 
i 
n— p 


The sum of the squared residuals, || Y — Y ||?, is often denoted by RSS, for residual 
sum of squares. 


Residuals and Standardized Residuals 


Information concerning whether or not a model fits is contained in the vector of 
residuals, 


é=Y-Y=(-P)Y 


As we did for the case of fitting a straight line, we will use the residuals to check on the 

adequacy of the fit of a presumed functional form and on the assumptions underlying 

the statistical analysis (such as that the errors are uncorrelated with constant variance). 
The covariance matrix of the residuals is 


X; = (1— P)(c?D( — P)” 
= o?(I— P) 


14.4.5 
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where we have used Lemma A of Section 14.4.3. We see that the residuals are corre- 
lated with one another and that different residuals have different variances. In order 
to make the residuals comparable to one another, they are often standardized. Also, 
standardization puts the residuals on the familiar scale corresponding to a normal 
distribution with mean 0 and variance 1 and thus makes their magnitudes easier to 
interpret. The ith standardized residual is 
Y 
SA/1l— pii 
where p;; is the ith diagonal element of P. 
A further property of the residuals is given by the following theorem. 


THEOREM A 
If the errors have the covariance matrix o7I, the residuals are uncorrelated with 


the fitted values. 


Proof 


The residuals are 
ê= (I — P)Y 
and the fitted values are 
Y=PY 
From Theorem D of Section 14.4.1, the cross-covariance matrix of é and Y is 


È = (1- P)(c?DP* 


= o° (P — PP’) 
E) 
This result follows from Lemma A of Section 14.4.3. E 


In Section 14.2.2, we considered plotting residuals versus fitted values (see Figure 
14.9). According to this theorem, there should be no linear relationship in such a plot. 


Inference about 6 


In this section, we continue the discussion of the statistical properties of the least 
squares estimate B. In addition to the assumptions made previously, we will assume 
that the errors, e;, are independent and normally distributed. Because the components 
of B are in this case linear combinations of independent normally distributed random 
variables, they are also normally distributed. 

In particular, each component f; of B is normally distributed with mean f; 
and variance o7c;;, where C = (X"X)-!. The standard error of B; may thus be 
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EXAMPLEA 


estimated as 

M Bi = S/S Cii 
This result will be used to construct confidence intervals and hypothesis tests that 
will be exact under the assumption of normality and approximate otherwise (because 
D; may be expressed as a linear combination of the independent random variables 
e;, a version of the central limit theorem with certain assumptions on X implies the 


approximate result). 
Under the normality assumption, it can be shown that 


Bi- Bi 
SB; 
although we will not derive this result. It follows that a 100(1 — a)% confidence 
interval for 6; is 





Ih p 


Bi £ I, 5 (0 /2)s5, 


To test the null hypothesis Ho: 6; = Bio, where fo is a fixed number, we can use the 
test statistic 


^ 


jc B i Bio 
SBi 
Under Ho, this statistic follows a ¢ distribution with n — p degrees of freedom. The 
most commonly tested null hypothesis is Ho: 6; = 0, which states that x; has no 


predictive value. 
We will illustrate these concepts in the context of polynomial regression. 


Peak Area 

Let us return to Example A in Section 14.2.2 concerning the regression of peak area 
on percentage of FD&C Yellow No. 5. We have seen from the residual plot in Figure 
14.5 that a straight line appears to give a reasonable fit. Consider enlarging the model 
so that it is quadratic: 


y = Bo + Bix + Box? 


where y is peak area and x is percentage of Yellow No. 5. The following table gives 
the statistics of the fit: 








Coefficient Estimate Standard Error t Value 
Bo 058 054 1.07 
By 11.17 1.20 9.33 
Bo —1.90 5.53 —.35 





To test the hypothesis Ho: 62 = 0, we would use —.35 as the value of the f statistic, 
which would not reject Ho. Thus, this test, like the residual analysis, gives no evidence 
that a quadratic term is needed. a 
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EXAMPLEB In Example B in Section 14.2.2, we saw that a residual plot for the linear regression 
of stream flow rate on depth indicated the inadequacy of that model. The statistics for 
a quadratic model are given in the following table: 





Coefficient Estimate Standard Error t Value 
Po 1.68 1.06 1.59 
By —10.86 4.52 —2.40 
Bo 23.54 4.27 5.51 





Here, the linear and quadratic terms are both statistically significant, and a residual 
plot, Figure 14.21, shows no signs of systematic misfit. 


AL 


Residuals 





“2 3 A 5 6 7 8 
Depth 


FIGURE 14.21 Residual plot from the quadratic regression of flow rate on stream 
depth. 


We have seen that the estimated covariance matrix of B is 
S 2 xTx-1 
Lap = s (X X) 
The corresponding correlation matrix for the coefficients is 


1.00 —.99 .97 
—.99 1.00 —.99 
.97 —.99 1.00 


(Note that the correlation matrix does not depend on s and is therefore completely 
determined by X.) The correlation matrix shows that fluctuations in the components of 
P are strongly interrelated. The linear coefficient, f, is negatively correlated with both 
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14.5 


the constant and quadratic coefficients, which in turn are positively correlated with 
each other. This partly explains why the values of the constant and linear coefficients 
change so much (they become ĝo = —3.98 and 6, = 13.83) when the quadratic term 
is absent from the model. E 


The estimated covariance matrix of B is useful for other purposes. Suppose that xo 
is a vector of predictor variables and that we wish to estimate the regression function 
at Xo. The obvious estimate is 


fio = X) B 
The variance of this estimate is 
Var(ftg) = xi 2 4X0 
= 07x! (X X)^!xy 


This variance can be estimated by substituting s? for o?, yielding a confidence interval 
for uo, 





flo <= n—p(a/2)Spy 


Note that Var(/i9) depends on xy. This dependency is explored further in Problem 13 
of the end-of-chapter problems. 


Multiple Linear Regression—An Example 


This section gives a brief introduction to the subject of multiple regression. We will 
consider the statistical model 


yi = Bo + fixa + Boxi2 + +++ + Bp-1Xi,p—1 + €i plan 


As before, Bo, £1, .... Bp—1 are unknown parameters and the e; are independent 
random variables with mean zero and variance o?. The fj; have a simple interpretation: 
Êk is the change in the expected value of y if x, is increased by one unit and the 
other x's are held fixed. Usually, the x's are measurements on different variables, 
but polynomial regression can be incorporated into this model by letting x;; = x2. 
Xi; = xj, and so on. 

We will develop and illustrate several concepts by means of an example (Wein- 
dling 1977). Other examples are included in the end-of-chapter problems. Heart 
catheterization is sometimes performed on children with congenital heart defects. 
A Teflon tube (catheter) 3 mm in diameter is passed into a major vein or artery at the 
femoral region and pushed up into the heart to obtain information about the heart's 
physiology and functional ability. The length of the catheter is typically determined 
by a physician's educated guess. In a small study involving 12 children, the exact 
catheter length required was determined by using a fluoroscope to check that the tip 
of the catheter had reached the pulmonary artery. The patients' heights and weights 
were recorded. The objective was to see how accurately catheter length could be 
determined by these two variables. The data are given in the following table: 
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Distance to 





Height Weight Pulmonary Artery 
(in.) (Ib) (cm) 
42.8 40.0 37.0 
63.5 93.5 49.5 
37.5 35.5 34.5 
39.5 30.0 36.0 
45.5 52.0 43.0 
38.5 17.0 28.0 
43.0 38.5 37.0 
22.5 8.5 20.0 
37.0 33.0 33.5 
23.5 9.5 30.5 
33.0 21.0 38.5 
58.0 79.0 47.0 
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Because this is a very small sample, any conclusions must be regarded as tentative. 
Figure 14.22 presents scatterplots of all pairs of variables, providing a useful 
visual presentation of their relationships. We will refer to these plots as we proceed 


through the analysis. 


Weight (1b) Height (in.) 


Distance (cm) 





Height (in.) 











Weight (Ib) 


Distance (cm) 


FIGURE 14.22 Scatterplots showing all pairings of the variables height, weight, and 


catheter length. 
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We first consider predicting the length by height alone and by weight alone. The 
results of simple linear regressions are tabulated below: 








Height Weight 
Bo 12.1 (+4.3) 25.6 (+2.0) 
fi .60 (+.10) .28 (4.04) 
s 4.0 3.8 
r 78 .80 





The standard errors of f, and f, are given in parentheses. To test the null hypothesis 
Ho: B1 = O, the appropriate test statistic is t = 6, /sg,. (These null hypotheses are 
of no real interest in this problem, but we show the tests for pedagogical purposes.) 
Clearly, this null hypothesis would be rejected in this case. The predictions from both 
models are similar; the standard deviations of the residuals about the fitted lines are 
4.0 and 3.8, respectively, and the squared correlation coefficients are .78 and .80. 

The panels of Figure 14.23 are plots of the standardized residuals from each of 
the simple linear regressions versus the respective independent variable. The plot of 
residuals versus weight shows some hint of curvature, which is also apparent in the 
bottom middle scatterplot in Figure 14.22. The largest standardized residual from this 
fit comes from the lightest and shortest child (see eighth row of data table). 

We next consider the multiple regression of length on height and weight together, 
since perhaps better predictions may be obtained by using both variables rather than 
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FIGURE 14.23 Standardized residuals from simple linear regressions of catheter 
length plotted against the independent variables (a) height and (b) weight. 
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either one alone. The method of least squares produces the following relationship: 


Length = 21(4£8.8) + .20(4.36) x height + .19(+.17) x weight 








where the standard errors of the coefficients are shown in parentheses. The standard 
deviation of the residuals is 3.9. 

The squared multiple correlation coefficient, or coefficient of determination, 
is sometimes used as a crude measure of the strength of a relationship that has been 
fit by least squares. This coefficient is simply defined as the squared correlation of 
the dependent variable and the fitted values. It can be shown that the squared multiple 
correlation coefficient, denoted by R?, can be expressed as 


2_ 
y 


2 
Sy 


2 
M $5 


R= 





Since this is the ratio of the difference between the variance of the dependent variable 
and the variance of the residuals from the fit to the variance of the dependent variable, 
it can be interpreted as the proportion of the variability of the dependent variable that 
can be explained by the independent variables. For the catheter example, R? = .81. 

Consider the coefficients and their standard error shown in the table above. It 
may seem surprising that the standard errors of the coefficients of height and weight 
are large relative to the coefficients themselves. Applying t tests would not lead to 
rejection of either of the hypotheses Hı: Bj = 0 or H»: B; = O. Yet in the simple 
linear regressions carried out above, the coefficients were highly significant. A partial 
explanation of this is that the coefficients in the simple regressions and the coefficients 
in the multiple regression have different interpretations. In the multiple regression, 
P1 is the change in the expected value of the catheter length if height is increased 
by one unit and weight is held constant. It is the slope along the height axis of the 
plane that describes the relation of length to height and weight; the large standard 
error indicates that this slope is not well resolved. To see why, consider the scatterplot 
of height versus weight in Figure 14.22. The method of least squares fits a plane to 
the catheter length values that correspond to the pairs of height and weight values 
in this plot. It should be intuitively clear from the figure that the slope of the fitted 
plane is relatively well resolved along the line about which the data points fall but 
poorly resolved along lines on which either height or weight is constant. Imagine 
how the fitted plane might move if values of length corresponding to pairs of height 
and weight values were perturbed. Variables that are strongly linearly related, such 
as height and weight in this example, are said to be highly collinear. If the values of 
height and weight had fallen exactly on a straight line, we would not have been able 
to determine a plane at all; in fact, X would not have had full column rank. 

The plot of height versus weight should also serve as a caution concerning making 
predictions from such a study. Obviously, we would not want to make a prediction 
for any pair of height and weight values quite dissimilar to those used in making the 
original fit. Any empirical relationship developed in the region in which the observed 
data fall might break down if it were extrapolated to a region in which no data had 
been observed. 

Little or no reduction in s has been obtained by fitting simultaneously to height 
and weight rather than fitting to either height or weight alone. (In fact, fitting to weight 
alone gives a smaller value of s than does fitting to height and weight together. This 
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FIGURE 14.24 Standardized residuals from the multiple regression of catheter 
length on height and weight plotted against the independent variables (a) height and 
(b) weight. 


may seem paradoxical, but recall that there are 10 degrees of freedom in the former 
case and 9 in the latter, and that s is the square root of the residual sum of squares 
divided by the degrees of freedom.) Again, this is partially explained by Figure 14.22, 
which shows that weight can be predicted quite well from height. Thus, it should not 
seem surprising that adding weight to the equation for predicting from height produces 
little gain. 

Finally, the panels of Figure 14.24 show the residuals from the multiple regression 
plotted versus height and weight. The plots are very similar to those in Figure 14.23. 

This simple example illustrates that the interpretation of regression coefficients 
is problematical, since the coefficient of a given variable depends on what other 
variables are included in the regression—that coefficient can change dramatically, 
and can even change in sign, as other variables are included or dropped from the 
model. Tukey and Mosteller (1977) give an example of the use of multiple regression 
in a study of influences on student achievement. The variables are 


y — verbal achievement score of 6th graders 
x, = staff salaries per pupil 
x2 = percentage of white collar fathers 
x3 = socioeconomic status 
x4 = teachers’ average verbal scores 


xs = mothers’ average education (1 unit = 2 school years) 
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A multiple regression fit results in 
y = 19.9 — 1.79x, + .0432x, + 0.556x3 + 1.11x4 — 1.79xs 


Is the policy implication that it is best to pay teachers small salaries and not educate 
mothers? Clearly, many of the predictors are highly correlated with each other and 
are also correlated with variables that are not in the model, and literal interpretation 
of a coefficient as being the effect if that variable is increased by one unit and the 
others are held fixed is fallacious. Also note that this is an observational study, not a 
controlled experiment. 


Conditional Inference, Unconditional 
Inference, and the Bootstrap 


The results in this chapter on the statistical properties of least squares estimates have 
been derived under the assumptions of a linear model relating independent variables 
X to dependent variables Y of the form 


Y-Xf +e 


In this formulation, the independent variables have been assumed to be fixed with 
randomness arising only through the errors e. This model seems appropriate for some 
experimental setups, such as that of Section 14.1, where fixed percentages of dies, X, 
were used and peak areas on a chromatograph, Y, were measured. However, consider 
Example B of Section 14.2.2, where the flow rate of a stream was related to its depth. 
The data consisted of measurements from 10 streams and it would seem to be rather 
forced to model the depths of those streams as being fixed and the flow rates as being 
random. In this section, we pursue the consequences of a model in which both X and 
Y are random, and we discuss the use of the bootstrap to quantify the uncertainty in 
parameter estimates under such a model. 

First we need to develop some notation. The design matrix will be denoted as a 
random matrix E and a particular realization of this random matrix will be denoted, as 
before, by X. The rows of = will be denoted by &,, &,,..., &, and the rows of a real- 
ization X by xj, x», ..., x,. In place of the model Y; = x; f + ej, where x; is fixed and 
e; is random with mean 0 and variance o?, we will use the model E(Y|& = x) = xf 
and Var(Y|8 = x) = o?. In the fixed X model, the e; were independent of each other. 
In the random X model, Y and & have a joint distribution (for which the conditional 
distribution of Y given $ has mean and variance as specified before) and the data are 
modeled as n independent random vectors, (Y;, 8), (Yo, 8&5), ..., (Yn, E) drawn 
from that joint distribution. The previous model is seen to be a conditional version of 
the new model—the analysis is conditional on the observed values X1, X», ..., Xn. 

We will now deduce some of the consequences for least squares parameter esti- 
mation under the new, unconditional, model. First, we have seen that in the old model, 
the least squares estimate of f is unbiased (Theorem A of Section 14.2.2); viewed 
within the context of the new model we would express this result as E (B [E = X) - f. 
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We can use Theorem A of Section 4.4.1 to find E (B ) under the new model: 
E(B) = E(E(B|E) 

= E(P) 

=£ 
where the outer expectation is with respect to the distribution of E. The least squares 
estimate is thus unbiased under the new model as well. 

Next we consider the variance of the least squares estimate. From Theorem B of 

Section 14.22, Var(f;|E£ = X) = o?(X"X);;. This is the conditional variance. To 


find the unconditional variance we can use Theorem B of Section 4.4.1, according to 
which 


Var(f;) = Var (E(B;|=)) + E (Var(B;|£)) 
= Var(B;) + E (o*(2"8);;') 
= 0° E(=' E); 


This is a highly nonlinear function of the random vectors &;, £5,..., £, and would 
generally be difficult to evaluate analytically. 

Thus for the new, unconditional model, the least squares estimates are still unbi- 
ased, but their variances (and covariances) are different. Surprisingly, it turns out that 
the confidence intervals we have developed still hold at their nominal levels of cover- 
age. Let C (X) denote the 100(1 — w)% confidence interval for B; that we developed 
under the old model. Using 74 to denote the indicator variable of the event A, we can 
express the fact that this is a 100(1 — aw)% confidence interval as 


E (Ig ec% | = X) =l-a 


that is, the conditional probability of coverage is 1 — o. Because the conditional 
probability of coverage is the same for every value of E, the unconditional probability 
of coverage is also 1 — a: 


Elgjeciay = E (Eg ecalE)) 
=E- 


=l-a 


This very useful result says that for forming confidence intervals we can use the old 
fixed-X model and that the intervals we thus form have the correct coverage in the 
new random-X model as well. 

We complete this section by discussing how the bootstrap can be used to estimate 
the variability of a parameter estimate under the new model according to which the 
parameter estimate, say Ô, is based on n i.i.d. random vectors (Yis £y, (Y, &), ..., 
(Yn, En). Depending on the context, there are a variety of parameters 0 that might be 
of interest. For example, 0 could be one of the regression coefficients, 6;; 0 could be 
E(Y |E = xo), the expected response at a fixed level xo of the independent variables 
(see Problem 13); in simple linear regression, 0 could be that value xo such that 
E(Y|&$ = xo) = uo for some fixed mo; in simple linear regression, 0 could be the 
correlation coefficient of Y and £. Now if we knew the probability distribution of the 
random vector (Y, €), we could simulate the sampling distribution of the parameter 
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estimate in the following way: On the computer draw B (a large number) of n-tuples, 
(Yi, &i), (Yo, £3), ..., (Ve, Eg), from that distribution and for each draw compute the 
parameter estimate 6. This would yield Âi, 6,, cee Óp and the empirical distribution 
of this collection would be an approximation to the sampling distribution of Ó. In 
particular, the standard deviation of this collection would be an approximation to the 
standard error of Ó. 

This procedure is, of course, predicated on knowing the distribution of the 
random vector (Y, €), which is unlikely to be the case in practice. The bootstrap 
principle says to approximate this unknown distribution by the observed empirical 
distribution of (Yi, xi), (Y2, x2), ..., (Yn, x;)—that is, draw B samples of size n 
with replacement from (Y1, xi), (Y2, X2), ..., (Yn, Xn). For example, to approxi- 
mate the sampling distribution of a correlation coefficient, r, computed from n pairs 
(Yi, X1), (Yo, X5), ..., (Yn, Xn) one would draw B samples each of size n with re- 
placement from these pairs and from each sample one would compute the correlation 
coefficient, yielding rj, rj, ..., rg. The standard deviation of these would then be 
used as an estimate of the standard error of r. 


Local Linear Smoothing 


We motivate the material in this section with an example. Recapitulating the material 
in Example B of Section 10.7, recall that an inductive loop detector is a wire loop 
embedded in the pavement of a roadway. From the output of the detector, the number 
of passing vehicles (flow), and the percentage of time that the detector was covered 
by a vehicle (occupancy), is reported to a traffic management center. If a detector is 
faulty or not operating at all, it may be desirable to estimate its flow and occupancy 
from flow and occupancy in other lanes. Such estimates might be used in summaries 
of traffic patterns, for example. 

Figure 14.25 is a plot of occupancy in lane 3 of a particular freeway location 
versus occupancy in lane 1. Lane 1 is the leftmost lane and lane 3 is the rightmost. 
The two are clearly strongly related, but the relationship is not linear. The dashed line 
is the line occupancy 3 = occupancy 1, and the solid line is the regression 
line. The data depart systematically from both of these relationships. It is interesting 
that at low occupancies, the values of lane 3 tend to be larger than those of lane 1. As 
occupancies increase, those in lane | are larger, except for very high occupancy, in 
which case they are about equal. These very high occupancies correspond to extreme 
congestion in which the traffic conditions in the two lanes are very similar. 

Now suppose we want to estimate the expected occupancy in lane 3 given the 
occupancy in lane 1, for example, to use the values from lane 1 to estimate those 
for lane 3 when the latter are missing due to detector malfunction. First, observe 
that although the relationship is clearly not globally linear, it is locally linear—over 
a small range of lane | values, the relationship between lane 3 and lane | is nearly 
linear, as is shown in Figure 14.26. 

To conform with generic notation, let x and y denote occupancies in lanes 1 and 
3, respectively, and suppose that we want to estimate the value of y corresponding to 
a value xo. Local linearity suggests that we choose a “bandwidth” h (e.g., h = .05) 
and fit a linear relationship between y and x over the range xo — h < x < xo +h. 
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FIGURE 14.25 Occupancy in lane 3 versus that in lane 1. The dashed line is the 
line y = x, and the solid line is the least squares fit. 
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FIGURE 14.26 Linear relationships fit over the ranges 0 < x < .1,.1 < x < 2, 
2<x <.3,.3<x< 4,and.4<x <.5. 


This would amount to finding Bp and £; to minimize 


S(Bo, Bi) = M Oi — Bo — Bixi) wy (x; — xo) 


i=l 
where the weight function w,(u) equals 1 for —h < u < h and 0 elsewhere. The 
fitted value corresponding to xo would then be f + 8;xo. For example, if x; = .25 
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FIGURE 14.27 Local linear estimates using a Gaussian weight function with increasingly 
large bandwidths. 


and h = .05, the fitted line is that shown over the region .2 < x < .3 in Figure 14.26, 
and the fitted value at xy = .25 is the height of the regression line at that point. 

The weight function wy; (u) is rectangular—it gives equal weight to all (y;, x;) 
pairs for which x) — h < x; < xo + h. Rather than weight all pairs in this neigh- 
borhood equally, it is preferable to use weights that decay away from xo. This can be 
accomplished by letting w; (u) be a probability density function with mean zero and 
standard deviation h, for example, a Gaussian density. The estimate can be computed 
on a dense grid of values of xo, using at each point of the grid the weight function 
Wh (x; — xo), Which centers the density at xo. 

Results for four choices of the bandwidth A are shown in Figure 14.27. Notice 
that for the small value h = .01, the smoothed curve is quite wiggly, because for such 
a small bandwidth few points contribute to the fit. In contrast, the large value h = .25 
produces a very smooth curve, but one that oversmooths and fails to track local trends. 
The intermediate value h = .025 appears to do best at tracking the local trend. Also 
note that the curves are continuous, since S(fo, £1) is a continuous function of xo. 

Notice that the values of occupancy are not uniformly distributed but are much 
more dense for small values of occupancy than for large values. Thus a region of width 
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h — .025 centered on a low value of occupancy contains many more points than a 
region centered on a high value of occupancy. The smoothing is thus, in a sense, not 
uniform in occupancy. A common alternative to smoothing with a fixed bandwidth 
is to let the bandwidth depend on occupancy in such a way that a constant fraction, 
f. of points are contained within a bandwidth. Thus, if the fraction is f = 0.10, 
for example, the bandwidth, (xo), corresponding to a value xo, of the independent 
variable is such that 10% of the x values are in the interval xo + A (xo). 

The bandwidth can often be reasonably chosen by visual examination of the 
smoothed scatter plots, as in Figure 14.27. In some circumstances, though, it is de- 
sirable to select the bandwidth automatically from the data. Cross-validation is a 
commonly used procedure; for choosing a fraction f the algorithm is as follows: 





Specify a sequence of possible values of f : fi, fo. .... fm- 


For each k = 1,2,..., M. 
Fori = 1,2, ...,n,leave out the data point (y;, x;), smooth the rest of the data 


using the bandwidth f;, and use the result to predict y;. Denote the predicted 
value by $5. 
Compute the cross-validation score, CV (fx) = $5, Oc) — y. 
Select the bandwidth, which minimizes CV (f). 


Figure 14.28 shows the results of cross-validated choice of a smoothing fraction 
f and a Gaussian weight function. The left panel shows the cross-validation score. It 
is high for very small values of f, since the smoothed curve is very wiggly because 
locally it depends on a small number of observations. The score is also high for large 
values of f, which lead to over-smoothing. The minimizing fraction is f = 0.28 and 

















i 0.5 
0.286 + 2 
y 0.4 
0.285 + $ > 
: e 03 
. é e 
- i 3 
* Q 
? 0.284} : E. 
x 2 0.2 
j B 
0.283 + : 
é : 0.1 
0282Lb we 
1 a 1 1 1 0.0 1 1 1 1 1 
0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 
f Lane 1 occupancy 


FIGURE 14.28 Left panel: cross-validation score as a function of f. Right panel: local 
linear fit for the value of the minimizer of f, f — 0.28. 
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the corresponding smoothed curve shown in the right panel of the figure apparently 
does a good job at estimating the local trend. 


Concluding Remarks 


We have developed theory and techniques only for linear least squares problems. If the 
unknown parameters enter into the prediction equation nonlinearly, the minimization 
cannot typically be done in closed form, and iterative methods are necessary. Also, 
expressions in closed form for the standard errors of the coefficients cannot usually 
be obtained; linearization is often used to obtain approximate standard errors. The 
bootstrap can also be used. 

As has been mentioned, least squares estimates are not robust against outliers. 
There are robust methods for regression. The discussion of M estimates in Chapter 10 
suggests minimizing 


» (y, — Yi) 
i-l 


for a robust weight function, W. Note that the least squares estimate corresponds to 
V (x) = x?. The choice V (x) = |x| gives the curve-fitting analogue of the median. 

In some applications, a large number of independent variables are candidates for 
inclusion in the prediction equation. Various techniques for variable selection have 
been proposed, and research is still active in this area. 

In simple linear regression, points with x values at the extremes of the data exert a 
large influence on the fitted line. In multiple regression, a similar phenomenon occurs, 
but is not so easily detectable usually. For this reason, several measures of "influence" 
have been proposed. Good software packages routinely flag influential observations. 

The problem of errors introduced via calibration of instruments has not been 
fully discussed. Suppose, for example, that an instrument for measuring temperature 
is to be calibrated. Readings are taken at several known temperatures (the indepen- 
dent variables) and a functional relationship between the instrument readings (the 
dependent variable) and the temperatures is fit by the method of least squares. After 
this has been carried out, an unknown temperature is read by the instrument and 
is predicted using the fitted relationship. How do the errors in the estimates of the 
coefficients of the functional form propagate? That is, what is the uncertainty of the 
estimated temperature? This is an inverse problem, and its analysis is not completely 
straightforward. 


Problems 


1. Convert the following relationships into linear relationships by making transfor- 
mations and defining new variables. 


a. y —- a/(b + cx) 
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d. y = x/(a + bx) 
e. y= 1/(1 + e^) 


2. Plot y versus x for the following pairs: 


x |.34 1.38  —.65 .68 140 —.88 —.30 -1.8 .50 4-175 





y | 27 134 =53 35 128 =98 =.72 —81 64 —1.59 


a. Fit a line y = a + bx by the method of least squares, and sketch it on the plot. 
b. Fit a line x = c + dy by the method of least squares, and sketch it on the plot. 
c. Are the lines in parts (a) and (b) the same? If not, why not? 


. Suppose that y; = u + ej, where i = 1,..., n and the e; are independent errors 


with mean zero and variance o?. Show that y is the least squares estimate of ju. 


. Consider a standard linear regression model in which the freshman GPA is mod- 


eled to depend linearly on high school GPA: Y; = fo + fixi - ej, i = 1,2...,n. 
Suppose that different intercepts were to be allowed for females and males, and 
write the model as 


Y; = Ir(i)Br + Iv) Bu + fixi + ei 


where 7r (i) and Jy (i) are indicator variables taking on values 0 and 1 according 
to whether the gender of the ith person is female or male. Give the form of the 
design matrix for such a model. 


. Three objects are located on a line at points p; < p2 < ps. These locations are 


not precisely known. A surveyor makes the following measurements: 


a. He stands at the origin and measures the three distances from there to pi, p2, 
p3. Let these measurements be denoted by Y1, Y2, Y3. 

b. He goes to p, and measures the distances from there to p; and p3. Let these 
measurements be denoted by Y4, Y5. 

c. He goes to p» and measures the distance from there to p3. Denote this mea- 
surement by Y6. 


He thus makes six measurements in all, and they are all subject to error. In order 
to estimate the values pi, p2, p3, he decides to combine all the measurements 
by the method of least squares. Using matrix notation, explain clearly how the 
least squares estimates would be calculated (you don't have to do the actual 
calculations). 


. Two objects of unknown weights w; and w, are weighed on an error-prone 


pan balance in the following way: (1) object 1 is weighed by itself, and the 
measurement is 3 g; (2) object 2 is weighed by itself, and the result is 3 g; (3) 
the difference of the weights (the weight of object 1 minus the weight of object 
2) is measured by placing the objects in different pans, and the result is 1 g; (4) 
the sum of the weights is measured as 7g. The problem is to estimate the true 
weights of the objects from these measurements. 


a. Set up a linear model, Y = Xf + e. (Hint: The entries of X are 0 and +1.) 
b. Find the least squares estimates of w; and w2. 
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c. Find the estimate of c?. 

d. Find the estimated standard errors of the least squares estimates of part (b). 
e. Estimate w; — w, and its standard error. 

f. Test the null hypothesis Ho: w; = w2. 


. (Weighted Least Squares) Suppose that in the model y; = Pot Bix; +e;, the 
errors have mean zero and are independent, but Var(e;) = p707, where the p; 
are known constants, so the errors do not have equal variance. This situation 
arises when the y; are averages of several observations at x;; in this case, if y; 
is an average of n; independent observations, o? = 1/n; (why?). Because the 
variances are not equal, the theory developed in this chapter does not apply; 
intuitively, it seems that the observations with large variability should influence 
the estimates of £o and £; less than the observations with small variability. 

The problem may be transformed as follows: 


6j yi = p; ' Bo T 0; fixi T 0; ei 
or 
Zi = Ui Po + vifi + ôi 
where 
u; = p; ' vi = pj xi ôi = p; 'e 
a. Show that the new model satisfies the assumptions of the standard statistical 
model. 
b. Find the least squares estimates of f, and f. 


c. Show that performing a least squares analysis on the new model, as was done 
in part (b), is equivalent to minimizing 


Do- fixiy B 


This is a weighted least squares criterion; the observations with large variances 
are weighted less. 
d. Find the variances of the estimates of part (b). 


. (The QR Method) This problem outlines the basic ideas of an alternative method, 
the QR method, of finding the least squares estimate B . An advantage of the 
method is that it does not include forming the matrix X X, a process that tends 
to increase rounding error. The essential ingredient of the method is that if X, x p 
has p linearly independent columns, it may be factored in the form 


X =Q R 


nx p nxppxp 


where the columns of Q are orthogonal (QTQ = I) and R is upper-triangular 
(rij; = 0, fori > j) and nonsingular. [For a discussion of this decomposition and 
its relationship to the Gram-Schmidt process, see Strang (1980). l 

Show that B = (XTX)-'XTY may also be expressed as Ê = R-!QTY, 
or RB = QTY. Indicate how this last equation may be solved for B by back- 
substitution, using that R is upper-triangular, and show that it is thus unnecessary 
to invert R. 
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14. 


(Cholesky Decomposition) This problem outlines the basic ideas of a popular and 
effective method of computing least squares estimates. Assuming that its inverse 
exists, XTX is a positive, definite matrix and may be factored as XTX = RTR, 
where R is an upper-triangular matrix. This factorization is called the Cholesky 
decomposition. Show that the least squares estimates can be found by solv- 
ing the equations 


R'v-X"'Y 

RB =v 
where v is appropriately defined. Show that these equations can be solved by 
back-substitution because R is upper-triangular, and that therefore it is not nec- 


essary to carry out any matrix inversions explicitly to find the least squares 
estimates. 


Show that the least squares estimates of the slope and intercept of a line may be 
expressed as 





Bo =F — Bik 
and 
O D-DD- 
Bi a = n 
2.68 =z) 


Show that if x = 0, the estimated slope and intercept are uncorrelated under the 
assumptions of the standard statistical model. 


Use the result of Problem 10 to show that the line fit by the method of least 
squares passes through the point (X, y). 


Suppose that a line is fit by the method of least squares to n points, that the 
standard statistical model holds, and that we want to estimate the line at a new 
point, xo. Denoting the value on the line by uo, the estimate is 


fio = Bo + Bixo 


a. Derive an expression for the variance of ĝo. 

b. Sketch the standard deviation of ĝo as a function of xo — x. The shape of the 
curve should be intuitively plausible. 

c. Derive a 9596 confidence interval for uo = Bo + 6; Xo under an assumption of 
normality. 


Problem 13 dealt with how to form a confidence interval for the value of a line at a 
point xo. Suppose that instead we want to predict the value of a new observation, 
Yo, at Xo, 


Yo = Bo + Bixo + eo 
by the estimate 


Yo = Bo "E Bixo 


15. 


16. 


17. 


18. 


19. 


20. 
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a. Find an expression for the variance of 1, — Yo, and compare it to the expression 
for the variance of fto obtained in part (a) of Problem 13. Assume that eg is 
independent of the original observations and has the variance o?. 

b. Assuming that eo is normally distributed, find the distribution of Y, — Yo. Use 
this result to find an interval J such that P (Yọ € J) = 1 — o. This interval is 
called a 100(1 — w)% prediction interval. 


Find the least squares estimate of 6 for fitting the line y = fx to points (xi, y;), 
where i = 1,..., n. 
Consider fitting the curve y = fox + fx? to points (x;, yj), wherei = 1,..., n. 


a. Use the matrix formalism to find expressions for the least squares estimates 
of Bo and fi. 
b. Find an expression for the covariance matrix of the estimates. 


This problem extends some of the material in Section 14.2.3. Let X and Y be 
random variables with 


EQ)-u,  —— EQ)-n 
Var(X) — ae Var(Y) = o; 


Cov(X, Y) = Oxy 


Consider predicting Y from X as Y = a + BX, where o and £ are chosen to 
minimize E(Y — ¥)*, the expected squared prediction error. 


a. Show that the minimizing values of o and f are 





[Hint: E(Y — Y? = (EY — EY)* + Va(Y — Y).] 
b. Show that for this choice of o and £ 
Va(Y)— Va(Y - Y) , 
= r, 
Var(Y) ui 





Suppose that 
Y; = fo + Bix; + ei, iD-—J1: 


where the e; are independent and normally distributed with mean zero and vari- 
ance o?. Find the mle's of f and f; and verify that they are the least squares 
estimates. (Hint: Under these assumptions, the Y; are independent and normally 
distributed with means f) + B x; and variance o?. Write the joint density function 
of the Y; and thus the likelihood.) 


a. Show that the vector of residuals is orthogonal to every column of X. 
b. Use this result to show that the residuals sum to zero and thus the sum has 
expectation zero if the model contains an intercept term. 


Assume that the columns of X, X,,..., X,, are orthogonal; that is, X/X; = 0, 
fori ~ j. Show that the covariance matrix of the least squares estimates is 
diagonal. 
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22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


Suppose that n points x1, ..., x, are to be placed in the interval [—1, 1] for fitting 
the model 


Y; = Bo + fixi + €i 


where the e; are independent with common variance c?. How should the x; be 
chosen in order to minimize Var(p,)? 


Suppose that the relation of family income to consumption is linear. Of those 
families in the 90th percentile of income, what proportion would you expect to 
be at or above the 90th percentile of consumption: (a) exactly 50%, (b) less than 
50%, (c) more than 50%? Justify your answers. 


Suppose that grades on a midterm and a final have a correlation coefficient of .5 
and both exams have an average score of 75 and a standard deviation of 10. 


a. If a student's score on the midterm is 95, what would you predict his score on 
the final to be? 

b. If a student scored 85 on the final, what would you guess that her score on the 
midterm was? 


Suppose that the independent variables in a least squares problem are replaced by 
rescaled variables u;; = k;x;; (for example, centimeters are converted to meters.) 
Show that Y does not change. Does f change? (Hint: Express the new design 
matrix in terms of the old one.) 


Suppose that each setting x; of the independent variables in a simple least squares 
problem is duplicated, yielding two independent observations Y;,, Y;,. Is it true 
that the least squares estimates of the intercept and slope can be found by doing 
a regression of the mean responses of each pair of duplicates, Y; = (Y;, + Yi,)/2 
on the x;? Why or why not? 


Suppose that Zi, Zo, Z3, Z4 are random variables with Var(Z;) = 1 and 
Cov(Zi, Z;) = p fori Æ j. Use the matrix techniques we developed in Section 
14.4.1 to show that Z, + Z2 + Z3 + Z4 is uncorrelated with Z; + Zo — Z3 — Z4. 


For the standard linear model of Section 14.4.2, show that 
o?I = Epp + Xa 


Conclude that 


no? — A Var($;) + 5 Var(é;) 
i=l i=l 
Suppose that X;,..., X, are independent with mean u; and common variance 
c?. LetY — De aj Xj. 
a. Let Z = $7; , b; X;. Use Theorem D of Section 14.4.1 to find Cov(Y, Z). 
b. Use Theorem C of Section 14.4.1 to find E(* 7 X 7; ., XiX;). 


Assume that X, and X» are uncorrelated random variables with variance c?, and 
use matrix methods to show that Y = X, +X2 and Z = X, — X» are uncorrelated. 
(Hint: Find Lyz.) 


30. 


31. 


32. 


33. 


34. 


35. 


36. 
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Let X;,..., X, be random variables with Var(X;) = o? and Cov(X;, x)= 
po”, fori # j. Use matrix methods to find Var(X). 


Let Z be a random vector with 4 components and covariance matrix o?7. Let 
U = Zi + Z2 + 23+ Z4 and V = (Zi + Z2) — (Za + Z4). Use matrix methods 
to find Cov(U, V). 


Let X be a random n-vector and let Y be a random vector with Y; — Xi, 

Yi = Xi — Xip, i = I Hh 

a. Ifthe X; are independent random variables with variances c? , find the covari- 
ance matrix of Y. 

b. If the Y; are independent random variables with variances c^, find the covari- 
ance matrix of X. 


a. Let X ^ N(0, 1) and E ~ N(0, 1) be independent, and let Y = X + BE. 
Show that 
1 
JVB +1 
b. Use the results of part (a) to generate bivariate samples (x;, y;) of size 20 with 
population correlation coefficients —.9, —.5, 0, .5, and .9, and compute the 
sample correlation coefficients. 


c. Have a partner generate scatterplots as in part (b) and then guess the correlation 
coefficients. 


hey = 


Generate a bivariate sample of size 50 as in Problem 33 with correlation coeffi- 
cient .8. Find the estimated regression line and the residuals. Plot the residuals 
versus X and the residuals versus Y. Explain the appearance of the plots. 


An investigator wants to use multiple regression to predict a variable, Y, from two 
other variables, X; and X2. She proposes forming a new variable X5 = X, + X2 
and using multiple regression to predict Y from the three X variables. Show that 
she will run into problems because the design matrix will not have full rank. 


The file bismuth contains the transition pressure (bar) of the bismuth II-I tran- 
sition as a function of temperature (°C) (see Example E in Section 14.2.2). Fit a 
linear relationship between pressure and temperature, examine the residuals, and 
comment. 


Dissociation pressure for a reaction involving barium nitride was recorded as a 
function of temperature (Orcutt 1970). The second law of thermodynamics gives 
the approximate relationship 


B 
In(pressure) = A + y 


where T is absolute temperature. From the data in the file bar ium, estimate A 
and B and their standard errors. Form approximate 95% confidence intervals for 
A and B. Examine the residuals and comment. 


The file sapphire lists observed values of Young's modulus (g) measured at 
various temperatures (T) for sapphire rods (Ku 1969). Fit a linear relationship 
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41. 


42. 


g = o+ Bit, and form confidence intervals for the coefficients. Examine the 
residuals. 


As part of a nuclear safeguards program, the contents of a tank are routinely 
measured. The determination of volume is made indirectly by measuring the dif- 
ference in pressure at the top and at the bottom of the tank. The tank is cylindrical 
in shape, but its internal geometry is complicated by various pipes and agitator 
paddles. Without these complications, pressure and volume should have a linear 
relationship. To calibrate pressure with respect to volume, known quantities (x) 
of liquid are placed in the tank and pressure readings (y) are taken. The data 
in the file tankvolume are from Knafl et al. (1984). The units of volume are 
kiloliters and those of pressure are pascals. 


a. Plot pressure versus volume. Does the relationship appear linear? 

b. Calculate the linear regression of pressure on volume, and plot the residuals 
versus volume. What does the residual plot show? 

c. Try fitting pressure as a quadratic function of volume. What do you think of 
the fit? 


The following data come from the calibration of a proving ring, a device for 
measuring force (Hockersmith and Ku 1969). 


a. Plot load versus deflection. Does the plot look linear? 

b. Fit deflection as a linear function of load, and plot the residuals versus load. 
Do the residuals show any systematic lack of fit? 

c. Fit deflection as a quadratic function of load, and estimate the coefficients and 
their standard errors. Plot the residuals. Does the fit look reasonable? 





Deflection 
Load Run 1 Run 2 Run 3 
10,000 68.32 68.35 68.30 
20,000 136.78 136.68 136.80 


30,000 204.98 205.02 204.98 
40,000 273.85 273.85 273.80 
50,000 342.70 342.63 342.63 
60,000 411.30 411.35 411.28 
70,000 480.65 480.60 480.63 
80,000 549.85 549.85 549.83 
90,000 619.00 619.02 619.10 
100,000 688.70 688.62 688.58 





The file chestnut contains the diameter (feet) at breast height (DBH) and the 
age (years) of 27 chestnut trees (Chapman and Demeritt 1936). Try fiting DBH 
as a linear function of age. Examine the residuals. Can you find a transformation 
of DBH and/or age that produces a more linear relationship? 


The stopping distance (y) of an automobile on a certain road was studied as a 
function of velocity (Brownlee 1960). The data are listed in the following table. 


43. 


44. 


45. 
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Fit y and ,/y as linear functions of velocity, and examine the residuals in each 
case. Which fit is better? Can you suggest any physical reason that explains why? 








Velocity (mi/h) Stopping Distance (ft) 
20.5 15.4 
20.5 13.3 
30.5 33.9 
40.5 73.1 
48.8 113.0 
57.8 142.6 





Chang (1945) studied the rate of sedimentation of amoebic cysts in water, in 
attempting to develop methods of water purification. The following table gives 
the diameters of the cysts and the times required for the cysts to settle through 
720 um of still water at three temperatures. Each entry of the table is an average of 
several observations, the number of which is given in parentheses. Does the time 
required appear to be a linear or a quadratic function of diameter? Can you find 
a model that fits? How do the settling rates at the three temperatures compare? 
(See Problem 7.) 














Settling Times of Cysts (sec) 
Diameter (um) 10°C 25°C 28°C 
11.5 217.1 (2) 138.2 (1) 128.4 (2) 
13.1 168.3 (3) 109.3 (3) 103.1 (4) 
14.4 136.6 (11) 89.1 (13) 82.7 (11) 
15.8 114.6 (17) 73.0 (11) 70.5 (18) 
17.3 96.4 (8) 61.3 (6) 59.7 (6) 
18.7 80.8 (5) 56.2 (4) 50.0 (4) 
20.2 70.4 (2) 46.3 (1) 41.4 (2) 








Cogswell (1973) studied a method of measuring resistance to breathing in chil- 
dren. The file asthma lists respiratory resistance and height (cm) for children 
with asthma and the file cystfibr contains results for children with cystic fi- 
brosis. Is there a statistically significant relation between respiratory resistance 
and height in either group? 


The file reading contains average reading scores of third-graders from several 
elementary schools on a standardized test in each of two successive years. Is there 
a "regression effect"? 


Measurement of the concentration of small asbestos fibers is important in studies 
of environmental health issues and in setting and enforcing appropriate regula- 
tions. The concentrations of such fibers are measured most accurately by an elec- 
tron microscope, but for practical reasons, optical microscopes must sometimes 
be used. Kiefer et al. (1987) compared measurements of asbestos fiber concentra- 
tion from 30 airborne samples by a scanning electron microscope (SEM) and by a 
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phase contrast microscope (PCM). The data are contained in the file asbestos. 
Study the relationship between the two measurements, taking the more accurate 
SEM measurements as the independent variable and the PCM measurements as 
the dependent variable. 


Aerial survey methods are used to estimate the number of snow geese in their 
summer range areas west of Hudson's Bay in Canada. To obtain estimates, small 
aircraft fly over the range and, when a flock of snow geese is spotted, an ex- 
perienced observer estimates the number of geese in the flock. To investigate 
the reliability of this method, an experiment in which an airplane carried two 
observers flew over 45 flocks, and each observer independently estimated the 
number of geese in the flock. Also, a photograph of the flock was taken so that an 
exact count of the number in the flock could be obtained (Weisberg 1985). The 
data are contained in the file geese. 


a. Draw scatterplots of observer counts, Y , versus photo count, x. Do these graphs 
suggest that a simple linear regression model might be appropriate? 

b. Calculate the linear regressions. What are the residual standard errors, what do 
they mean, and how do they compare? Do the fitted regressions appear to be 
different? Plot residuals and absolute values of residuals versus photo counts. 
Do the residuals indicate any systematic misfit? Does the residual variation 
appear to be constant? 

c. Repeat the above using the square root transformation on the counts. Does 
this transformation stabilize the variance? 

d. You have now computed the fits in two ways. How do they compare? 

e. Write a few sentences in answer to the questions, “How well do observers 
estimate the number of geese?" *How do the two observers compare?" 


The volume, height, and diameter at 4.5 ft above ground level were measured for 
a sample of 31 black cherry trees in the Allegheny National Forest in Pennsyl- 
vania. The data were collected to provide a basis for determining an easy way 
of estimating the volume of a tree. Develop a model relating volume to height 
and diameter. The columns of the data matrix are diameter, height, and volume, 
in that order (Ryan, Joiner, & Ryan 1976). The data are contained in the file 
treevolume. 


The file flow-occ contains data collected by loop detectors in all three lanes 
(see Section 14.7). Examine the relationship of flow in lane 3 versus that in lane 
1. Make a scatterplot and fit a regression line. Does the linear relationship look 
accurate or is there some systematic misfit? Fit local linear relationships with 
several bandwidths. Identify a bandwidth that is too small and one that is too 
large. What bandwidth appears to provide a good balance between being too 
wiggly and being over-smooth? 


The file binary59683 contains measurements of the light of an astronomical 
source as a function of time. Time is in units of days (Julian date), and bright- 
ness is measured as “magnitude.” According to this system of measurement, the 
brightest star has magnitude — 1.4 and the faintest visible star has magnitude 6, 
so the larger the magnitude, the dimmer the light. 


51. 
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a. Plot magnitude versus time. Do you see any structure? 

b. The object is actually an eclipsing binary system (two stars rotating around 
each other) with a period P = 0.407528 days. Define s = t mod P and plot 
magnitude versus s. Can you qualitatively explain the shape of the "lightcurve" 
you see? 

c. The lightcurve contains a lot of information about the nature of the binary 
system. Use local linear smoothing to estimate the underlying lightcurve. 

d. Change the period slightly and see how the lightcurve changes. Do this sev- 
eral times. Can you propose a method to find an unknown period? About how 
accurately can the period be estimated from this data? 


This data comes from the Hipparcos mission. More information, more lightcurves, 
and interactive demos can be found at http: / /www.rssd.esa.int/SA- 
general/Projects/Hipparcos/education.html. 


The following table shows the monthly returns of stock in Disney, MacDonalds, 
Schlumberger, and Haliburton for January through May 1998. Fit a multiple re- 
gression to predict Disney returns from those of the other stocks. What is the 
standard deviation of the residuals? What is R?? 








Disney MacDonalds Schlumberger Haliburton 
0.08088 —0.01309 —0.08463 —0.13373 
0.04737 0.15958 0.02884 0.03616 
—0.04634 0.09966 0.00165 0.07919 
0.16834 0.03125 0.09571 0.09227 
—0.09082 0.06206 —0.05723 —0.13242 





Next, using the regression equation you have just found, carry out the pre- 
dictions for January through May of 1999 and compare to the actual data listed 
below. What is the standard deviation of the prediction error? How can the com- 
parison with the results from 1998 be explained? Is a reasonable explanation that 
the fundamental nature of the relationships changed in the one year period? 








Disney MacDonalds Schlumberger Haliburton 
0.1 0.02604 0.02695 0.00211 
0.06629 0.07851 0.02362 —0.04 
—0.11545 0.06732 0.23938 0.35526 
0.02008 — 0.06483 0.06127 0.10714 
— 0.08268 — 0.09029 —0.05773 —0.02933 





The file bodytemp contains normal body temperature readings (degrees 
Fahrenheit) and heart rates (beats per minute) of 65 males (coded by 1) and 
65 females (coded by 2) from Shoemaker (1996). 


a. For both males and females, make scatterplots of heart rate versus body tem- 
perature. Comment on the relationship or lack thereof. 
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b. Does the relationship for males appear to be the same as that for females? Ex- 
amine this question graphically, by making a scatterplot showing both females 
and males and identifying females and males by different plotting symbols. 

c. Forthe males, fit a linear regression to predict heart rate from temperature. Plot 
the residuals versus temperature and comment on whether the relationship is 
linear. Find the estimated slope and its standard error. 

d. Repeat the above for females. 

e. Test whether the slopes for males and females are equal. (Hint: Consider the 
difference of the slopes.) 

f. Test whether the intercepts are equal. 


Old Faithful geyser in Yellowstone National Park, Wyoming, derives its name 
from the regularity of its eruptions. The file oldfaithful contains measure- 
ments on eight successive days of the durations of the eruptions (in minutes) 
and the subsequent time interval before the next eruption. The park posts pre- 
dicted eruption times for vistors. How well can the time until the next eruption 
be predicted by the duration of the current one? 


a. Does the use of linear regression appear to be appropriate? 

b. If the duration is 2 minutes, what would you predict the time until the next 
eruption to be? How can you quantify the accuracy of the prediction? Repeat 
this analysis for a duration of 4.5 minutes. 


In 1970, Congress instituted a lottery for the military draft to support the unpopu- 
lar war in Vietnam. All 366 possible birth dates were placed in plastic capsules 
in a rotating drum and were selected one by one. Eligible males born on the first 
day drawn were first in line to be drafted, etc. The results were criticized by some 
who claimed that government incompetency at running a fair lottery resulted in 
a tendency of men born later in the year being more likely to be drafted. Indeed, 
later investigation revealed that the birth dates were placed in the drum by month 
and were not thoroughly mixed. The columns of the file 19701ottery are 
month, month number, day of the year, and draft number. 


a. Plot draft number versus day number. Do you see any trend? 
b. Plot the linear regression line on the scatterplot. 
c. Plot a local linear smoothing on the scatterplot. Try varying the bandwidth. 


When gasoline is pumped into the tank of an automobile, hydrocarbon vapors in 
the tank are forced out and into the atmosphere, producing a significant amount 
of air pollution. For this reason, vapor-recovery devices are often installed on 
gasoline pumps. It is difficult to test a recovery device in actual operation, be- 
cause all that can be measured is the amount of vapor actually recovered and, by 
means of a “sniffer,” whether any vapor escaped into the atmosphere. To estimate 
the efficiency of the device, it is thus necessary to estimate the total amount of 
vapor in the tank by using its relation to the values of variables that can actually 
be measured. In this exercise, you will try to develop such a predictive relation- 
ship using data that were obtained in a laboratory experiment. The file gasva- 
por contains recordings of the following variables: initial tank temperature (^F), 
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temperature of the dispensed gasoline (°F), initial vapor pressure in the tank (psi), 
vapor pressure of the dispensed gasoline (psi), and emitted hydrocarbons (g). A 
prediction of emitted hydrocarbons is desired. 

First, randomly select 40 observations and set them aside. You will develop 
a predictive relationship based on the remaining observations and then test its 
strength on the observations you have held out. (It is instructive to have each 
student in the class hold out the same 40 observations and then compare results.) 


a. Look at the relationships among the variables by scatterplots. Comment on 
which relationships look strong. Based on this information, what variables 
would you conjecture will be important in the model? Do the plots suggest 
that transformations will be helpful? Do there appear to be any outliers? 

b. Try fitting a few different models and select two that you think are the best. 

c. Using these two models, predict the responses for the 40 observations you 
have held out and compare the predictions to the observed values by plotting 
predicted versus observed values, and by plotting prediction errors versus each 
of the independent variables. Summarize the strength of the prediction by the 
root mean square prediction error: 





40 
1 * 

RMSPE — mam Y, — Ý)? 
7o ) 


i=l 
where Y; is the ith observed value and f, is the predicted value. 


Recordings of the levels of pollutants and various meteorological conditions are 
made hourly at several stations by the Los Angeles Pollution Control District. 
This agency attempts to construct mathematical/statistical models to predict pol- 
lution levels and to gain a better understanding of the complexities of air pollution. 
Obviously, very large quantities of data are collected and analyzed, but only a 
small set of data will be considered in this problem. The file airpollution 
contains the maximum level of an oxidant (a photochemical pollutant) and the 
morning averages of four meteorological variables: wind speed, temperature, hu- 
midity, and insolation (a measure of the amount of sunlight). The data cover 30 
days during one summer. 


a. Examine the relationship of oxidant level to each of the four meteorological 
variables and the relationships of the meteorological variables to each other. 
How well can the maximum level of oxidant be predicted from some or all of 
the meteorological variables? Which appear to be most important? 

b. The standard statistical model used in this chapter assumes that the errors are 
random and independent of one another. In data that are collected over time, 
the error at any given time may well be correlated with the error from the pre- 
ceding time. This phenomenon is called serial correlation, and in its presence 
the estimated standard errors of the coefficients developed in this chapter may 
be incorrect. The parameter estimates are still unbiased, however. (Why?) Can 
you detect serial correlation in the errors from your fits? 
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Common Distributions 


Discrete Distributions 
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Poisson 
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Continuous Distributions 








Normal 
f(x) = mere —00 < X < oo 
OX 27 
E(X) =p 
Var(X) = o? 
M(t) = ett ert [2 
Gamma 
AE 
fix)- Wor cen x >0 
: rœ) 
a 
E(X) = — 
(X) E 
Var(X) = — 
2 


M(t) à : t<i 
= — E < 
A—t 
Exponential (Special Case of Gamma with o — 1) 


Chi-Square with n Degrees of Freedom (Special Case of Gamma with a = n/2, 
A= 3) 
2 


Uniform 


E(X)= 1 
Var(X) — 5 
e'—1 
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G4) ci yt garsi 


IQ) = Tare) ' 
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E(X) = 
ab 
(a+bP(a+b+1) 


M (t) is not useful. 
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TABLE 1 Binomial Probabilities 
Tabulated values are iem p(x). (Computations are rounded off at the third decimal place.) 
nes 
p 
k .01 .05 10 .20 .30 .40 .50 .60 .70 .80 .90 .95 .99 
0 .951 .774  .590 .328 .168 .078 .031 .010 .002 .000 .000 .000 .000 
1 999 977 .919  .737 .528 .337 .188 .087 .031 .007 .000 .000 .000 
2 1.000 .999 .99] .942 .837 .683 .500 .317 .163 .058 .009 .001 .000 
3 1.000 1.000 1.000 .993 .969 .913 .812 .663 .472 .263 .081 .023 .001 
4 1.000 1.000 1.000 1.000 .998 .990 .969 .922 .832 .672 .410 .226 .049 
n= 10 
p 
k .01 .05 10  .20 4.30  .40 .50 .60 .70 .80 .90 .95 .99 
0 904 .599 349 .107 .028 .006 .001 .000 .000 .000 .000 .000 .000 
1 .996 .914 .736 .376 .149 .046 .011 .002 .000 .000 .000 .000 .000 
2 1.000 .988 .930 .678 .383 .167 .055 .012 .002 .000 .000 .000 .000 
3 1.000 .999 .987 .879 .650 .382 .172 .055 .011 .001 .000 .000 .000 
4 1.000 1.000 .998 .967 .850 .633 .377 .166 .047 .006 .000 .000 .000 
3 1.000 1.000 1.000 .994 .953 .834 .623 .367 .150 .033 .002 .000 .000 
6 1.000 1.000 1.000 .999 ..989 .945 .828 .618 .350 .121 .013 .001 .000 
7 1.000 1.000 1.000 1.000 4.998 .988 .945 .833 .617 .322 .070 .012 .000 
8 1.000 1.000 1.000 1.000 1.000 .998 .989 .954 .851 .624 .264 .086 .004 
9 1.000 1.000 1.000 1.000 1.000 1.000 .999 .994 .972 .893 .651 .401 .096 
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n — 15 
P 
k Ol 05 10 20 .30 40 .50 60 .70 .80 .90 .95 .99 
0 .860 .463 .206 .035 .005 .000 .000 .000 .000 .000 .000 .000 .000 
1 .990 .829 .549 .167 .035 .005 .000 .000 .000 .000 .000 .000 .000 
2 1.000 .964 .816 .398 .127 .027 .004 .000 .000 .000 .000 .000 .000 
3 1.000 .995 .944 .648 .297 .091 .018 .002 .000 .000 .000 .000 .000 
4 1.000 .999 .987 .836 .515 .217 .059 .009 .001 .000 .000 .000 .000 
5 1.000 1.000 .998 .939 .722 .403 .151 .034 .004 .000 .000 .000 .000 
6 1.000 1.000 1.000 .982 .869 .610 .304 .095 .015 .001 .000 .000 .000 
7 1.000 1.000 1.000 .996 .950 .787 .500 .213 .050 .004 .000 .000 .000 
& 1.000 1.000 1.000 .999 .985 .905 .696 .390 .131 .018 .000 .000 .000 
9 1.000 1.000 1.000 1.000 .996 .966 .849 .597 .278 .061 .002 .000 .000 
10 1.000 1.000 1.000 1.000 .999 .99] .941 .783 .485 .164 .013 .001 .000 
11 1.000 1.000 1.000 1.000 1.000 .998 .982 .909 .703 .352 .056 .005 .000 
12 1.000 1.000 1.000 1.000 1.000 1.000 .996 .973 .873 .602 .184 .036 .000 
13 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .995 .965 .833 .451 .171 .010 
14 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .995 .965 .794 .537 .140 
n — 20 
P 

k Ol 05 10 20 .30 40 .50 60 .70 .80 .90 .95 .99 
0 818 .358 .122 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000 
1 .983  .736 .392 .069 .008 .001 .000 .000 .000 .000 .000 .000 .000 
2 999 .925 .677 .206 .035 .004 .000 .000 .000 .000 .000 .000 .000 
3 1.000 .984 .867 .411 .107 .016 .001 .000 .000 .000 .000 .000 .000 
4 1.000 .997 .957 .630 .238 .051 .006 .000 .000 .000 .000 .000 .000 
5 1.000 1.000 .989 .804 .416 .126 .021 .002 .000 .000 .000 .000 .000 
6 1.000 1.000 .998 .913 .608 .250 .058 .006 .000 .000 .000 .000 .000 
7 1.000 1.000 1.000 .968 .772 .416 .132 .021 .001 .000 .000 .000 .000 
$ 1.000 1.000 1.000 .990 .887 .596 .252 .057 .005 .000 .000 .000 .000 
9 1.000 1.000 1.000 .997 .952 .755 .412 .128 .017 .001 .000 .000 .000 
10 1.000 1.000 1.000 .999 .983 .872 .588 .245 .048 .003 .000 .000 .000 
11 1.000 1.000 1.000 1.000 .995 .943 .748 .404 .113 .010 .000 .000 .000 
12 1.000 1.000 1.000 1.000 .999 .979 .868 .584 .228 .032 .000 .000 .000 
13 1.000 1.000 1.000 1.000 1.000 .994 .942 .750 .392 .087 .002 .000 .000 
14 1.000 1.000 1.000 1.000 1.000 .998 .979 .874 .584 .196 .011 .000 .000 
15 1.000 1.000 1.000 1.000 1.000 1.000 .994 .949 .762 .370 .043 .003 .000 
16 1.000 1.000 1.000 1.000 1.000 1.000 .999 .984 .893 .589 .133 .016 .000 
17 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .996 .965 .794 .323 .075 .001 
18 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .999 .992 .931 .608 .264 .017 
19 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .999 .988 .878 .642 .182 
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n=25 
P 
k 01 05  .10 20 .30 40 .50 .60 .70 .80 .90 .95 .99 
0 778 .277 .072 .004 .000 .000 .000 .000 .000 .000 .000 .000 .000 
1 974 642 .271 .027 .002 .000 .000 .000 .000 .000 .000 .000 .000 
2 .998 .873 .537 .098 .009 .000 .000 .000 .000 .000 .000 .000 .000 
3 1.000 .966 .764 .234 .033 .002 .000 .000 .000 .000 .000 .000 .000 
4 1.000 .993 .902 .421 .090 .009 .000 .000 .000 .000 .000 .000 .000 
5 1.000 .999 .967 .617 .193 .029 .002 .000 .000 .000 .000 .000 .000 
6 1.000 1.000 .991 .780 .341 .074 .007 .000 .000 .000 .000 .000 .000 
7 1.000 1.000 .998 .891 .512 .154 .022 .001 .000 .000 .000 .000 .000 
8 1.000 1.000 1.000 .953 .677 .274 .054 .004 .000 .000 .000 .000 .000 
9 1.000 1.000 1.000 .983 .811 .425 .115 .013 .000 .000 .000 .000 .000 
10 1.000 1.000 1.000 .994 .902 .586 .212 .034 .002 .000 .000 .000 .000 
ll 1.000 1.000 1.000 .998 .956 .732 .345 .078 .006 .000 .000 .000 .000 
12 1.000 1.000 1.000 1.000 .983 .846 .500 .154 .017 .000 .000 .000 .000 
13 1.000 1.000 1.000 1.000 .994 .922 .655 .268 .044 .002 .000 .000 .000 
14 1.000 1.000 1.000 1.000 .998 .966 .788 .414 .098 .006 .000 .000 .000 
15 1.000 1.000 1.000 1.000 1.000 .987 .885 .575 .189 .017 .000 .000 .000 
16 1.000 1.000 1.000 1.000 1.000 .996 .946 .726 .323 .047 .000 .000 .000 
17 1.000 1.000 1.000 1.000 1.000 .999 .978 .846 .488 .109 .002 .000 .000 
18 1.000 1.000 1.000 1.000 1.000 1.000 .993 .926 .659 .220 .009 .000 .000 
19 1.000 1.000 1.000 1.000 1.000 1.000 .998 .971 .807 .383 .033 .001 .000 
20 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .991 .910 .579 .098 .007 .000 
21 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .998 .967 .766 .236 .034 .000 
22 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .991 .902 .463 .127 .002 
23. 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .998 .973 .729 .358 .026 
24 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .996 .928 .723 .222 
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TABLE 2 Cumulative Normal Distribution—Values of P Corresponding 


to Zp for the Normal Curve 


D 





ZP 


zis the standard normal variable. The value of P for —z, equals 1 minus the 
value of P for --z,; for example, the P for -1.62 equals 1-.9474 = .0526. 



































Zp .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 
.0 | .5000 | .5040 | .5080 | .5120 | .5160 | .5199 | .5239 | .5279 | .5319 | .5359 
.1 | 5398 | .5438 | .5478 | .5517 1.5557 | .5596 | .5636 | .5675 | .5714 | .5753 
.2 | .5793 | .5832 | .5871 | .5910 | 5948 | .5987 | .6026 | .6064 | .6103 | .6141 
3 | .6179 | .6217 | .6255 | .6293 | .6331 | .6368 | .6406 | .6443 | .6480 | .6517 
.4 | .6554 | .6591 | .6628 | .6664 | .6700 | .6736 | .6772 | .6808 | .6844 | .6879 
.5 | .6915 | .6950 | .6985 | .7019 | .7054 | .7088 | .7123 | .7157 | .7190 | .7224 
.6 | 7257 | .7291 | .7324 | .7357 | .7389 | 7422 | .7454 | .7486 | .7517 | .7549 
.7 | .7580 | .7611 | .7642 | .7673 | .7704 | .7734 | .7764 | .7794 | .7823 | .7852 
.8 | .7881 | .7910 | .7939 | .7967 | .7995 | .8023 | .8051 | .8078 | .8106 | .8133 
9 | .8159 | .8186 | .8212 | .8238 | .8264 | .8289 | .8315 | .8340 | .8365 | .8389 
1.0 1.8413 | .8438 | .8461 | .8485 | .8508 | .8531 | .8554 | .8577 | .8599 | .8621 
1.1 | .8643 | .8665 | .8686 | .8708 | .8729 | .8749 | .8770 | .8790 | .8810 | .8830 
1.2 | .8849 | .8869 | .8888 | .8907 | .8925 | .8944 | .8962 | .8980 | .8997 | .9015 
1.3 1.9032 | .9049 | .9066 | .9082 | .9099 | .9115 | .9131 | .9147 | .9162 | .9177 
1.4 1.9192 | .9207 | .9222 | .9236 | .9251 | .9265 | .9279 | .9292 | .9306 | .9319 
1.5 1.9332 | .9345 | .9357 | .9370 | .9382 | .9394 | .9406 | .9418 | .9429 | .9441 
1.6 | .9452 | .9463 | .9474 | .9484 | .9495 | .9505 | .9515 | .9525 | .9535 | .9545 
1.7 | .9554 | .9564 | .9573 | .9582 | .9591 | .9599 | .9608 | .9616 | .9625 | .9633 
1.8 | .9641 | .9649 | .9656 | .9664 | .9671 | .9678 | .9686 | .9693 | .9699 | .9706 
1.9 1.9713 | .9719 | .9726 | .9732 | .9738 | .9744 | .9750 | .9756 | .9761 | .9767 
2.0 | .9772 | .9778 | .9783 | .9788 | .9793 | .9798 | .9803 | .9808 | .9812 | .9817 
2.1 | .9821 | .9826 | .9830 | .9834 | .9838 | .9842 | .9846 | .9850 | .9854 | .9857 
2.2 | .9861 | .9864 | .9868 | .9871 | .9875 | .9878 | .9881 | .9884 | .9887 | .9890 
2.3 | .9893 | .9896 | .9898 | .9901 | .9904 | .9906 | .9909 | .9911 | .9913 | .9916 
2.4 | 9918 | .9920 | .9922 | .9925 | .9927 | .9929 | .9931 | .9932 | .9934 | .9936 
2.5 | .9938 | .9940 | .9941 | .9943 | .9945 | .9946 | .9948 | .9949 | .9951 | .9952 
2.6 | .9953 | .9955 | .9956 | .9957 | .9959 | .9960 | .9961 | .9962 | .9963 | .9964 
2.7 | .9965 | .9966 | .9967 | .9968 | .9969 | .9970 | .9971 | .9972 | .9973 | .9974 
2.8 | .9974 | .9975 | .9976 | .9977 | .9977 | .9978 | .9979 | .9979 | .9980 | .9981 
2.9 | .9981 | .9982 | .9982 | .9983 | .9984 | .9984 | .9985 | .9985 | .9986 | .9986 
3.0 | .9987 | .9987 | .9987 | .9988 | .9988 | .9989 | .9989 | .9989 | .9990 | .9990 
3.1 | .9990 | .9991 | .9991 | .9991 | .9992 | .9992 | .9992 | .9992 | .9993 | .9993 
3.2 | .9993 | .9993 | .9994 | .9994 | .9994 | .9994 | .9994 | .9995 | .9995 | .9995 
3.3 | .9995 | .9995 | .9995 | .9996 | .9996 | .9996 | .9996 | .9996 | .9996 | .9997 
3.4 | .9997 | .9997 | .9997 | .9997 | .9997 | .9997 | .9997 | .9997 | .9997 | .9998 
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TABLE3 Percentiles of the x? Distribution—Values of xp Corresponding to P 
































Xp 
df Xios Xii Xios Xs X30 X50 X5s X05 X50 X505 

1 .000039 .00016 .00098 .0039 .0158 2.71 3.84 5.02 6.63 7.88 
2 .0100 .0201 .0506 .1026 .2107 4.61 5.99 7.38 9.21 10.60 
3 .0717 .115 .216 .352 .584 6.25 7.81 9.35 11.34 12.84 
4 .207 297 484 711 1.064 7.78 9.49 11.14 13.28 14.86 
5 412 554 .831 1.15 1.61 9.24 11.07 12.83 15.09 16.75 
6 .676 .872 1.24 1.64 2.20 10.64 12.59 14.45 16.81 18.55 
7 .989 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48| 20.28 
8 1.34 1.65 2.18 2.73 3.49 13.36 15.51 17.53 | 20.09 | 21.96 
9 1.73 2.09 2.70 3:33 4.17 14.68 16.92 19.02 | 21.67 | 23.59 
10 2.16 2.56 3.25 3.94 4.87 15.99 18.31 | 20.48 | 23.21 | 25.19 
11 2.60 3.05 3.82 4.57 5.58 17.28 19.68 | 21.92 | 24.73 | 26.76 
12 3.07 3.57 4.40 5.23 6.30 18.55 21.03 | 23.34 | 26.22 | 28.30 
13 3.57 4.11 5.01 5.89 7.04 19.81 22.36 | 24.74 | 27.69 | 29.82 
14 4.07 4.66 5.03 6.57 7.19 21.06 23.68 | 26.12 | 29.14 | 31.32 
15 4.60 3.23 6.26 7.26 8.55 22.31 25.00 | 27.49 | 30.58 | 32.80 
16 5.14 5.81 6.91 7.96 9.31 23.54 26.30 | 28.85 | 32.00 | 34.27 
18 6.26 7.01 8.23 9.39 10.86 25.99 28.87 | 31.53 | 34.81 37.16 
20 7.43 8.26 9.59 10.85 12.44 28.41 31.41 34.17 | 37.57 | 40.00 
24 9.89 10.86 12.40 13.85 15.66 33.20 | 36.42 | 39.36] 42.98 | 45.56 
30} 13.79 14.95 16.79 18.49 20.60 40.26 | 43.77 | 46.98 | 50.89 | 53.67 
40 | 20.71 22.16 24.43 26.51 29.05 51.81 55.76 | 59.34 | 63.69 | 66.77 
60 | 35.53 37.48 40.48 43.19 46.46 74.40 | 79.08 | 83.30 | 88.38 | 91.95 
120| 83.85 86.92 91.58 95.70 100.62 140.23 | 146.57 | 152.21 | 158.95 | 163.64 


For large degrees of freedom, 


Xp = 4(zp + 2v — 1)? approximately, 


where v = degrees of freedom and zp is given in Table 2. 


TABLE 4 Percentiles of the t Distribution 
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tp 

df t.60 t.70 t go t 90 t95 Í 975 t 99 Í 995 
1 325 12] 1.376 3.078 6.314 12.706 31.821 63.657 
2 289 .617 1.061 1.886 2.920 4.303 6.965 9.925 
3 277 584 .978 1.638 2.353 3.182 4.541 5.841 
4 271 569 941 1.533 2.132 2.776 3.747 4.604 
5 .267 .559 .920 1.476 2.015 2.571 3.365 4.032 
6 .265 .553 .906 1.440 1.943 2.447 3.143 3.707 
7 .263 .549 .896 1.415 1.895 2.365 2.998 3.499 
8 .262 .546 .889 1.397 1.860 2.306 2.896 3.355 
9 .261 .543 .883 1.383 1.833 2.262 2.821 3.250 
10 .260 .542 .879 1.372 1.812 2.228 2.764 3.169 
11 .260 .540 .876 1.363 1.796 2.201 2.718 3.106 
12 .259 .539 .873 1.356 1.782 2.179 2.681 3.055 
13 .259 .538 .870 1.350 1.771 2.160 2.650 3.012 
14 258 537 .868 1.345 1.761 2.145 2.624 2.977 
15 258 536 .866 1.341 1.753 2.131 2.602 2.947 
16 .258 535 .865 1.337 1.746 2.120 2.583 2.921 
17 257 534 .863 1.333 1.740 2.110 2.567 2.898 
18 .257 .534 .862 1.330 1.734 2.101 2.552 2.878 
19 .257 .533 .861 1.328 1.729 2.093 2.539 2.861 
20 .257 .533 .860 1.325 1.725 2.086 2.528 2.845 
21 .257 .532 .859 1.323 1.721 2.080 2.518 2.831 
22 .256 .532 .858 1.321 1.717 2.074 2.508 2.819 
23 .256 .532 .858 1.319 1.714 2.069 2.500 2.807 
24 .256 .531 .857 1.318 1.711 2.064 2.492 2.797 
25 .256 .531 .856 1.316 1.708 2.060 2.485 2.787 
26 .256 .531 .856 1.315 1.706 2.056 2.479 2.719 
27 .256 .531 .855 1.314 1.703 2.052 2.473 2.771 
28 .256 .530 .855 1.313 1.701 2.048 2.467 2.763 
29 .256 .530 .854 1.311 1.699 2.045 2.462 2.756 
30 .256 .530 .854 1.310 1.697 2.042 2.457 2.750 
40 .255 .529 .851 1.303 1.684 2.021 2.423 2.704 
60 .254 527 .848 1.296 1.671 2.000 2.390 2.660 
120 .254 .526 .845 1.289 1.658 1.980 2.358 2.617 
oo .253 .524 .842 1.282 1.645 1.960 2.326 2.576 
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TABLE 5 Percentiles of the F Distribution: F.99(m1, n2) 

































































P 
Fp 
F gg. Ny) 
n, = degrees of freedom for numerator 
ny 

n 1 2 3 4 3 6 7 8 9 10 12 15 20 24 30 40 60 120 oo 
1 39.86 |49.50 |53.59 |55.83 | 57.24 | 58.20 |58.91 | 59.44 | 59.86 | 60.19 | 60.71 | 61.22 |61.74 | 62.00 | 62.26 | 62.53 | 62.79 | 63.06 | 63.33 
2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 9.41 9.42 9.44 9.45 9.46 9.47 9.47 9.48 9.49 
3 5.54 5.46 5.39 5.34 5.31 5.28 5:27 5.25 5.24 5.23 5.22 5.20 5.18 5.18 5.17 5.16 5.15 5.14 5.13 
4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 3.90 3.87 3.84 3.83 3.82 3.80 3.79 3.78 3.76 
B 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 3.27 3.24 3.21 3.19 3.17 3.16 3.14 3.12 3.10 
Ss 6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.74 2.72 
E 7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 2.67 2.63 2.59 2.58 2.56 2.54 2.51 2.49 2.47 
o 8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.50 2.50 2.46 2.42 2.40 2.38 2.36 2.34 2.32 2.29 
5 9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 2.38 2.34 2.30 2.28 2.25 2.23 2.21 2.18 2.16 
B 10 3.29 2.92 2.73 2.61 2.52 2.46 241 2.38 2.35 2.32 2.28 2.24 2.20 2.18 2.16 2.13 2.11 2.08 2.06 
"er 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 2:21 2.17 2.12 2.10 2.08 2.05 2.03 2.00 1.97 
= 12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2:2] 2.19 2.15 2.10 2.06 2.04 2.01 1.99 1.96 93 1.90 
D 13 3.14 2.76 2.56 2.43 2.35 2.28 2:23 2.20 2.16 2.14 2.10 2.05 2.01 1.98 1.96 1.93 1.90 1.88 .85 
B 14 3.10 2-73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 2.10 2.05 2.01 1.96 1.94 1.91 1.89 1.86 1.83 .80 
"8 15 3.07 2.70 2.49 2.36 2:2] 2.21 2.16 2.12 2.09 2.06 2.02 1.97 1.92 1.90 1.87 1.85 1.82 1.79 416 
$ 16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 1.99 1.94 1.89 1.87 1.84 1.81 1.78 .75 Np 
2 17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 1.96 1.91 1.86 1.84 1.81 1.78 1.75 1:72 .69 
2 18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 1.93 1.89 1.84 1.81 1.78 1.75 1.72 1.69 .66 
I 19 2.99 2.61 2.40 2:2] 2.18 241 2.06 2.02 1.98 1.96 1.91 1.86 1.81 1.79 1.76 1.73 1.70 67 63 
E 20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.89 1.84 1.79 1.77 1.74 1.71 1.68 .64 61 
21 2.96 2.97 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 1.87 1.83 1.78 1.75 1.72 1.69 1.66 1.62 59 
22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 1.86 1.81 1.76 1.73 1.70 1.67 1.64 .60 57 
23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 1.89 1.84 1.80 1.74 1.72 1.69 1.66 1.62 59 ;35 
24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 1.83 1.78 1.73 1.70 1.67 1.64 1.61 1.57 53 
25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 1.82 1-77 1:72 1.69 1.66 1.63 1.59 1.56 32 
26 2.91 2:52 2.31 2.17 2.08 2.01 1.96 1.92 1.88 1.86 1.81 1.76 1.71 1.68 1.65 1.61 1.58 1.54 50 
27 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87 1.85 1.80 1.75 1.70 1.67 1.64 1.60 1.57 1.53 49 
28 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87 1.84 1.79 1.74 1.69 1.66 1.63 1.59 1.56 1,52 48 
29 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86 1.83 1.78 1.73 1.68 1.65 1.62 1.58 1:55 1.51 AT 
30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 1.77 1:72 1.67 1.64 1.61 1.57 1.54 1.50 46 
40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 1.76 1.71 1.66 1.61 1.57 1.54 1.51 1.47 1.42 38 
60 2.79 2:39 2.18 2.04 1.95 1.87 1.82 1.77 1.74 1.71 1.66 1.60 1.54 1.51 1.48 1.44 1.40 1.35 29 
120 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 1.68 1.65 1.60 1:55 1.48 1.45 1.41 1.37 1.32 1.26 19 
oo 2.71 2.30 2.08 1.94 1.85 1.77 1.72 1.67 1.63 1.60 1.55 1.49 1.42 1.38 1.34 1.30 1.24 1.17 1.00 








LIV 


TABLE 5 Percentiles of the F Distribution: F 95(n1, n2) (Continued) 


n; = degrees of freedom for numerator 





ny 







































































n 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 oo 

1 161.4 1199.5 |215.7 |224.6 |2302 | 234.0 | 236.8 |238.9 |240.5 | 241.9 | 243.9 |245.9 |248.0 | 249.1 | 250.1 | 251.1 |2522 | 253.3 | 254.3 

2 18.51 | 19.00 | 19.16 | 19.25 | 19.30 | 19.33 | 19.35 | 19.37 | 19.38 | 19.40 | 19.41 | 19.43 | 19.45 | 19.45 | 19.46 | 19.47 | 19.48 | 19.49 | 19.50 

3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53 

4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 SAT 5.75 5.72 5.69 5.66 5.63 

5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36 
a 6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67 
B7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23 
E 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3:39 3.35 3.28 3.22 3,15 3.12 3.08 3.04 3.01 2.97 2.93 
5 9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71 
5 10 4.96 4.10 3.71 3.48 3.83 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54 
= 11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2:53 2.49 2.45 2.40 
E 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2:75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30 
e 13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2. 2.67 2.60 2:53 2.46 2.42 2.38 2.34 2.30 2.25 221 
B 14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2:53 2.46 2.39 2.35 2.31 2:27 2.22 2.18 2.13 
"8 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.50 2.54 2.48 2.40 2.33 2.29 2:25 2.20 2.16 2.11 2.07 
$ 16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01 
2 17 4.45 3.59. 3.20 2.96 2.81 2.70 2.61 2:55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96 
E 18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92 
I 19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23, 2.16 2.11 2.07 2.03 1.98 1.93 1.88 
EN 20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 .99 1.95 1.90 1.84 
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2:32 2.25 2.18 2.10 2.05 2.01 .96 92 1.87 1.81 
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 .98 94 89 1.84 1.78 

23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 91 1.86 1.81 1.76 

24 4.26 3.40 3.01 2.78 2.62 2:51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 94 .89 .84 1.79 1.73 
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 92 87 82 1.77 1.71 

26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2:22 2.15 2.07 1.99 1.95 1.90 .85 1.80 1.75 1.69 

27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2:25 2.20 2.13 2.06 1.97 1.93 1.88 .84 1.79 1.73 1.67 
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 87 82 p 1.71 1.65 
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94 1.90 .85 81 .75 1.70 1.64 
30 4.17 3.32 2.92 2.60 2.53 2.42 2.33 2.27 221 2.16 2.09 2.01 1.93 1.89 1.84 419 1.74 1.68 1.62 
40 4.08 3.23. 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 .69 1.64 1.58 1.51 
60 4.00 3.15 2.76 2.53 2:37 2.25 2.7 2.10 2.04 1.99 1.92 1.84 175 1.70 .65 59 53 1.47 1.39 
120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 50 1.43 1.35 1.25 
oo 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1:75 1.67 L:57 1.52 1.46 39 1.32 1:22 1.00 





CLV 


TABLE 5 Percentiles of the F Distribution: F 575(n1, n2) (Continued) 


n; = degrees of freedom for numerator 
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997.2 
39.46 
14.12 

8.51 


6.28 
5.12 
4.42 
3.95 
3.61 


3.37 
3.17 
3.02 
2.89 
2.79 


2.70 
2.63 
2.56 
2.50 
2.45 


241 
2.37 
2:33 
2.30 
2:27 


2.24 
2.22 
2.19 
2.17 
2.15 
2.14 
2.01 
1.88 
1.76 
1.64 





1001 
39.46 
14.08 

8.46 


6.23 
5.07 
4.36 
3.89 
3.56 


3.31 
3.12 
2.96 
2.84 
2.73 


2.64 
2.57 
2.50 
2.44 
2.39 


2.35 
2.31 
2.27 
2.24 
221 
2.18 
2.16 
2.13 
2.11 
2.09 
2.07 
1.94 
1.82 
1.69 
1.57 





1006 
39.47 
14.04 

8.41 


6.18 
5.01 
4.31 
3.84 
3.51 


3.26 
3.06 
2,91 
2.78 
2.67 


2.59 
2:51 
2.44 
2.38 
2.33 


2.29 
2.25 
221 
2.18 
2.15 


2.12 
2.09 
2.07 
2.05 
2.03 
2.01 
1.88 
1.74 
1.61 
1.48 





1010 
39.48 
13.99 

8.36 


6.12 
4.96 
4.25 
3.78 
3.45 


3.20 
3.00 
2.85 
2-72 
2.61 


2:52 
2.45 
2.38 
2.32 
2:2] 


2.22 
2.18 
2.14 
2.11 
2.08 


2.05 
2.03 
2.00 
1.98 
1.96 
1.94 
1.80 
1.67 
1.53 
1.39 





1014 
39.49 
13.95 

8.31 


6.07 
4.90 
4.20 
3:73 
3.39 
3.14 
2.94 
2.79 
2.66 
2:35 


2.46 
2.38 
2.32 
2.26 
2.20 
2.16 
241 
2.08 
2.04 
2.01 


.98 
.95 
1.93 
91 
.89 
1.87 
1.72 
58 
43 
1.27 
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TABLE 5 Percentiles of the F Distribution: F 99(n1, n2) (Continued) 


nı = degrees of freedom for numerator 








n = degrees of freedom for denominator 






























































n 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 oo 

1 |4052 | 4999.5 |5403 |5625 |5764 |5859 |5928  |5982 |6022 6056 |6106 |6157 |6209 |6235 |6261 |6287 |6313 |6339 | 6366 
2 98.50} 99.00} 99.17] 99.25| 99.30] 99.33] 99.36] 99.37] 99.39] 99.40] 99.42] 9943| 9945| 99.46] 9947| 9947| 9948| 99.49] 99.50 
3 34.12} 30.82] 29.46] 28.71] 28.24] 2791| 27.67] 2749| 27.35] 27.23] 2705| 2687] 26.69] 26.60) 2650] 2641] 26.32] 26.22] 26.13 
4 21.20} 18.00] 16.69] 15.98] 15.52] 15.21] 14.98] 14.80] 14.66] 14.55] 1437] 14.20] 14.02] 13.93] 13.84] 13.75] 13.65] 13.56] 13.46 
5 1626| 13.27] 12.06] 11.39] 1097] 10.67] 1046] 10.29] 10.16] 10.05 9.89] 972 9.55 9.47 9.38 9.29 9.20} 9.11 9.02 
6 1375| 10.92] 9.78 9.15 8.75 8.47 8.26 8.10} 798 7.87 7372| 7.56 740| 731 7.23 7.14 7.06| 697 6.88 
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99} 6.84] 6.72] 6.62 6.47 6.31 616| 607 5.99 5.91 5.82 5.74 5.65 
8 11.26 8.65 759| 701 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 536| 528 5.20] 5.12 5.03 4.95 4.86 
9 10.56 8.02} 6.99} 642] 606] 580| 5.61 5.47 5.35 5.26 5.11 496| 481 4.73 4.65 4.57 4.48 440| 4.31 
10 10.04 7.56] 655 5.99 5.64 5.39 520|  506| 4.94] 4.85 471 456| 4.41 4.33 4.25 4.17 4.08 4.00 3.91 
11 9.65 7.21 622| 567 5.32 5.07} 4.89] 4.74] 463 4.54] 440] 4.25 410| 402 3.94 3.86 3.78 3.69 3.60 
12 9.33 6.93 5.95 541 506|  482| 464] 450] 439] 430] 4.16] 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36 
13 9.07 6.70} 5.74] 521 486| 462| 444] 430] 419] 410] 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25 3.17 
14 8.86 6.51 5.56] 5.04] 469] 4.46] 4.28 4.14] 403 3.94 3.80 3.66 3.5 343 3.35 3.27 3.18 3.09 3.00 
15 8.68 636| 542| 489| 456| 432| 414] 400| 3.89 380| 3.67 3.52 3.37 3.29 321 3.13 3.05 2.96 2.87 
16 8.53 6.23 529} 477| 444] 420] 4.03 3.89 3.78 3.69 3.55 341 3.26 3.18 3.10 3.02 2.93 2.84 2.75 
J7 8.40 6.11 5.18 467| 434] 4.10] 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00] 292 2.83 2.75 2.65 
18 8.29 6.01 5.09} 4.58] 425 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00} 2.92] 284 2.75 2.66 2.57 
19 8.18 5.93 5.01 450| 4.17 3.94] 377 3.63 3.52 3.43 3.30 3.15 3.00] 292 2.84] 2.76 2.67 2.58 2.49 
20 8.10 5.85 4.94} 4.43] 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94] 2.86 2.78 2.69 2.61 2.52 2.42 
21 8.02 5.78] 4.87 437| 404 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.30} 2.72] 2.64 2.55 2.46 2.36 
22 7.95 5.72] 482] 431 3.99 3.76] 3.59 345 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50| 240 2.31 
23 7.88 5.66} 4.76] 4.26] 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 270| 262| 2.54 2.45 2.35 2.26 
24 7.82 5.61 47| 422 3.90 3.67 2.50 3.36 3.26 3.17 3.03 2.89 2.74| 2.66 2.58 2.49 2.40} 2.31 2.21 
25 70 5.57] 468 4.18 3.85 3.63 3.46 3.32 322 3.13 2.99 2.85 270| 2.62 254| 245 236| 227 247 
26 732 553| 464] 414] 382 3.59 342 3.29 3.18 3.09 296| 2.81 2.66| 2.58 250| 242 2.33 223 2.3 
27 7.68 549} 460] 4.11 3.78 3.56] 3.30 3.26 3.15 3.06 2.93 2.78 2.63 2.55 247 2.38 2.29] 220 2.10 
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.75 260| 2.52 244| 235 226| 247 2.06 
29 7.60 542} 4.54] 404| 3.73 3.50] 333 3.20] 3.09 300| 2.87 2.73 2.5] 2.49 241 2.33 2.23 2.14 2.03 
30 7.56 539} 4.51 402| 3.70 347 3.30 347 3.07 2.98 2.84] 2.70] 255 247 2.39 2.30 221 241 2.01 
40 731 5.18} 431 3.83 3.5 3.29 312| 2.99 2.89 280| 2.66] 2.52 2.37 2.29 2.20} 241 2.02 1.92 1.80 
60 7.08 498| 413 3.65 3.34 3.12 2.95 2.82 2.72] 2.63 2.50 2.35 220| 242 2.03 1.94 1.84 1.73 1.60 
120 6.85 4.79 3.95 348 3.17 2.96 2.79| 2.66 2.56 247 234| 249 2.03 1.95 1.86 1.76 1.66 1.53 1.38 
oo 6.63 4.61 3.78 3.32] 302 280: 2.64] 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 147 1.32 1.00 


A14 . AppendixB Tables 


TABLE 6 Percentiles of the Studentized Range: q.so 








P 
dp 
q = w/s where w is the range of t observations, and v is the number of degrees of 
freedom associated with the standard deviation s. 
t 
v 2 3 4 5 6 7 8 9 10 
1 8.93 13.44 16.36 18.49 20.15 21.51 22.64 23.62 24.48 
2 4.13 13 6.77 7.54 8.14 8.63 9.05 9.41 9.72 
3 3.33 4.47 5.20 5.74 6.16 6.51 6.81 7.06 7.29 
4 3.01 3.98 4.59 5.03 3:39 5.68 3.93 6.14 6.33 
2 2.85 3.72 4.26 4.66 4.98 524 5.46 5.65 5.82 
6 2.75 3.56 4.07 4.44 4.73 4.97 5.17 5.34 5.50 
7 2.68 3.45 3.93 4.28 4.55 4.78 4.97 5.14 5.28 
$ 2.63 3.37 3.83 4.17 4.43 4.65 4.83 4.99 5.13 
9 2.59 3.32 3.76 4.08 4.34 4.54 4.72 4.87 5.01 
10 2.56 3.27 3.70 4.02 4.26 4.47 4.64 4.78 4.91 
11 2.54 3.23 3.66 3.96 4.20 4.40 4.57 4.71 4.84 
12 2.52 3.20 3.62 3.92 4.16 4.35 4.51 4.65 4.78 
13 2.50 3.18 3.59 3.88 4.12 4.30 4.46 4.60 4.72 
14 2.49 3.16 3.56 3.85 4.08 4.27 4.42 4.56 4.68 
15 2.48 3.14 3.54 3.83 4.05 4.23 4.39 4.52 4.64 
16 2.47 3.12 3.52 3.80 4.03 4.21 4.36 4.49 4.61 
17 2.46 3.11 3.50 3.78 4.00 4.18 4.33 4.46 4.58 
18 2.45 3.10 3.49 3.77 3.98 4.16 4.31 4.44 4.55 
19 2.45 3.09 3.47 3:75 3.97 4.14 4.29 4.42 4.53 
20 2.44 3.08 3.46 3.74 3.95 4.12 4.27 4.40 4.51 
24 2.42 3.05 3.42 3.69 3.90 4.07 4.21 4.34 4.44 
30 2.40 3.02 3.39 3.65 3.85 4.02 4.16 4.28 4.38 
40 2.38 2.99 3.35 3.60 3.80 3.96 4.10 4.21 4.32 
60 2.36 2.96 3:31 3.56 3.75 3.91 4.04 4.16 4.25 
120 2.34 2.93 3.28 3.52 3.71 3.86 3.99 4.10 4.19 
oo 2.33 2.90 3.24 3.48 3.66 3.81 3.93 4.04 4.13 





























TABLE 6 Percentiles of the Studentized Range: q.9) (Continued) 


Appendix B Tables 


A15 





t 





























v 11 12 13 14 15 16 17 18 19 20 
1 | 25.24 | 25.92 | 2654 | 2710 | 27.62 | 2810 | 28.54 | 2896 | 2935 | 29.71 
2 | 1001 | 1026 | 10.49 | 10.70 | 10.89 | 11.07 | 1124 | 1139 | 11.54 | 11.68 
3| 749 7.67 7.83 7.98 8.12 8.25 8.37 8.48 8.58 8.68 
4| 649 6.65 6.78 6.91 7.02 7.13 7.23 7.33 741 7.50 
5| 597 6.10 6.22 6.34 6.44 6.54 6.63 6.71 6.79 6.86 
6 | 5.64 5.76 5.87 5.98 6.07 6.16 6.25 6.32 6.40 6.47 
7| 541 5.53 5.64 5.74 5.83 5.91 5.99 6.06 6.13 6.19 
8 | 525 5.36 5.46 5.56 5.64 5.72 5.80 5.87 5.93 6.00 
9 5.13 5.23 5.33 5.42 5.51 5.58 5.66 5.72 5.79 5.85 

10 | 5.03 5.13 5.23 5.32 5.40 5.47 5.54 5.61 5.67 5.73 
11 | 495 5.05 5.15 5.23 5.31 5.38 5.45 5.51 5.57 5.63 
12 | 489 4.99 5.08 5.16 5.24 5.31 5.37 5.44 5.49 5.55 
13 | 483 4.93 5.02 5.10 5.18 5.25 5.31 5.37 5.43 5.48 
14 | 479 4.88 4.97 5.05 5.12 5.19 5.26 5.32 5.37 5.43 
15 | 4.75 4.84 4.93 5.01 5.08 5.15 521 527 5.32 5.38 
16 | 4.71 4.81 4.89 4.97 5.04 5.11 5.17 5.23 5.28 5.33 
17 | 4.68 4.71 4.86 4.93 5.01 5.07 5.13 5.19 5.24 5.30 
18 | 4.65 4.75 4.83 4.90 4.98 5.04 5.10 5.16 5.21 5.26 
19 | 463 4.72 4.80 4.88 4.95 5.01 5.07 5.13 5.18 5.23 
20 | 4.61 4.70 4.78 4.85 4.92 4.99 5.05 5.10 5.16 5.20 
24 | 4.54 4.63 471 4.78 4.85 4.91 4.97 5.02 5.07 5.12 
30 | 4.47 4.56 4.64 4.71 4.77 4.83 4.89 4.94 4.99 5.03 
40 | 441 4.49 4.56 4.63 4.69 4.75 4.81 4.86 4.90 4.95 
60 | 434 4.42 4.49 4.56 4.62 4.67 4.73 4.78 4.82 4.86 
120 | 4.28 4.35 4.42 4.48 4.54 4.60 4.65 4.69 4.74 4.78 
oo | 421 4.28 4.35 441 4.47 4.52 4.57 4.61 4.65 4.69 


A16 . AppendixB Tables 


TABLE 6 Percentiles of the Studentized Range: q.s; (Continued) 








t 
Dq 2 3 4 5 6 7 8 9 10 
1| 1797 | 2698 | 3282 | 3708 | 4041 | 4312 | 4540 | 4736 | 49.07 
2| 6.08 8.33 9.80 | 1088 | 1174 | 1244 | 13.03 | 13.54 | 13.99 
3| 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46 
4| 3.93 5.04 5.76 6.29 671 7.05 7.35 7.60 7.83 
5| 3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99 
6 | 3.46 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49 
7| 334 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16 
8| 326 4.04 4.53 4.89 547 5.40 5.60 5.77 5.92 
9 | 320 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74 
10 | 315 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60 
ir| 311 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49 
12| 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39 
13| 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32 
14 | 3.03 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25 
15| 301 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20 
16 | 3.00 3.65 4.05 4.33 4.56 474 4.90 5.03 5.15 
17| 298 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11 
18| 297 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07 
19 | 2.96 3.59 3.98 4.25 447 4.65 4.79 4.92 5.04 
20 | 295 3.58 3.96 4.23 4.45 4.62 471 4.90 5.01 
24| 292 3.53 3.90 4.17 437 4.54 4.68 4.81 4.92 
30 | 289 3.49 3.85 4.10 4.30 4.46 4.60 4.72 4.82 
40 | 2.86 344 3.79 4.04 4.23 4.39 4.52 4.63 4.73 
60 | 2.83 3.40 3.74 3.98 4.16 431 4.44 4.55 4.65 
120 | 2.80 3:36 3.68 3.92 4.10 424 4.36 447 4.56 





























oo 2.77 3.31 3.63 3.86 4.03 4.17 4.29 4.39 4.47 


TABLE 6 Percentiles of the Studentized Range: q.95 (Continued) 
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t 





























v 11 12 13 14 15 16 17 18 19 20 
1 | 5059 | 51.96 | 5320 | 5433 | 5536 | 5632 | 5722 | 5804 | 58.83 | 59.56 

2 | 1439 | 1475 | 15.08 | 1538 | 15.65 | 1591 | 1614 | 1637 | 1657 | 16.77 
3 97 9.95 | 10.15 | 1035 | 10.52 | 10.69 | 10.84 | 1098 | IL11 | 1124 
4 8.03 821 8.37 8.52 8.66 8.79 8.91 9.03 9.13 9.23 
5| 717 7.32 7.47 7.60 772 7.83 7.93 8.03 8.12 821 
6 | 6.65 6.79 6.92 7.03 7.14 7.24 7.34 7.43 7.51 7.59 
7| 630 6.43 6.55 6.66 6.76 6.85 6.94 7.02 7.10 747 
8| 6.05 6.18 6.29 6.39 6.48 6.57 6.65 6.73 6.80 6.87 
9| 587 5.98 6.09 6.19 6.28 6.36 6.44 6.51 6.58 6.64 
10| 5.72 5.83 5.93 6.03 6.11 6.19 6.27 634 | 640 6.47 
11 | 561 5.71 5.81 5.90 5.98 6.06 6.13 6.20 | 6.27 6.33 
12 | 5.51 5.61 5.71 5.80 5.88 5.95 6.02 6.09 6.15 6.21 
13 | 543 5.53 5.63 5.71 5.79 5.86 5.93 5.99 6.05 6.11 
14 | 536 5.46 5.55 5.64 5.71 5.79 5.85 5.91 5.97 6.03 
15 | 531 5.40 5.49 5.57 5.65 5.72 5.78 5.85 5.90 5.96 
16 | 5.26 5.35 5.44 5.52 5.59 5.66 5.73 5.79 5.84 5.90 
17| 521 5.31 5.39 5.47 5.54 5.61 5.67 5.73 5.79 5.84 
J& | 517 5.27 5.35 5.43 5.50 5.57 5.63 5.69 5.74 5.79 
19 | 544 5.23 5.31 5.39 5.46 5.53 5.59 5.65 5.70 5.75 
20 | 5.11 5.20 5.28 5.36 5.43 5.49 5.55 5.61 5.66 571 
24 | 5.01 5.10 5.18 5.25 5.32 5.38 5.44 5.49 5.55 5.59 
30 | 492 5.00 5.08 5.15 5.21 527 5.33 5.38 5.43 5.47 
40 | 482 4.90 4.98 5.04 5.11 5.16 5.22 5.27 5.31 5.36 
60 | 473 4.81 4.88 4.94 5.00 5.06 5.11 5.15 5.20 5.24 
120 | 4.64 4.71 4.78 4.84 4.90 4.95 5.00 504 | 5.09 5.13 
co | 4.55 4.62 4.68 4.74 4.80 4.85 4.89 4.93 4.97 5.01 
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t 
NN 3 3 4 5 6 7 8 9 10 
1 | 90.03 | 135.0 164.3 185.6 202.2 215.8 227.2 237.0 245.6 
2|1404 | 1902 22.29 24.72 26.63 28.20 29.53 30.68 31.69 
3| 826 | 10.62 12.17 13.33 14.24 15.00 15.64 16.20 16.69 
4| 651 8.12 9.17 9.96 10.58 11.10 11.55 11.93 12:27 
5| 5.70 6.98 7.80 8.42 8.91 9.32 9.67 9.97 10.24 
6| 524 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10 
7| 495 5.92 6.54 7.01 7.37 7.68 7.94 8.17 8.37 
8| 4.75 5.64 6.20 6.62 6.96 7.24 741 7.68 7.86 
9| 4.60 5.43 5.96 6.35 6.66 6.91 743 7.33 7.49 
10 | 4.48 527 571 6.14 6.43 6.67 6.87 7.05 721 
1| 439 5.15 5.62 5.97 6.25 6.48 6.67 6.84 6.99 
12| 432 5.05 5.50 5.84 6.10 6.32 6.51 6.67 6.81 
13| 426 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.67 
14| 421 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54 
15| 447 4.84 5.25 5.56 5.80 5.99 6.16 6.31 6.44 
16 | 4.13 4.79 5.19 5.49 5.72 5.92 6.08 6.22 6.35 
17 | 440 4.74 5.14 5.43 5.66 5.85 6.01 6.15 6.27 
18 | 407 4.70 5.09 5.38 5.60 5.79 5.94 6.08 6.20 
19| 4.05 4.67 5.05 5.33 5.55 5.73 5.89 6.02 6.14 
20 | 4.02 4.64 5.02 5.29 5.51 5.69 5.84 5.97 6.09 
24| 3.96 4.55 4.91 5.17 5.37 5.54 5.69 5.81 5.92 
30| 3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.76 
40 | 3.82 4.37 4.70 4.93 5.11 5.26 5.39 5.50 5.60 
60 | 3.76 4.28 4.59 4.82 4.99 5.13 5.25 5.36 5.45 
120 | 3.70 4.20 4.50 4.71 4.87 5.01 5.12 5.21 5.30 
oo | 3.64 4.12 4.40 4.60 4.76 4.88 4.99 5.08 5.16 


TABLE 6 Percentiles of the Studentized Range: g.s (Continued) 


Appendix B Tables 


A19 



































t 

NI Il 12 13 14 15 16 17 18 19 20 
1 | 253.2 260.0 266.2 271.8 277.0 281.8 286.3 290.4 294.3 298.0 
2.| 32.59 33.40 34.13 34.81 35.43 36.00 36.53 37.03 37.50 37.95 
3| 17.13 17.53 17.89 18.22 18.52 18.81 19.07 19.32 19.55 19.77 
4| 12.57 12.84 13.09 13.32 13.53 13.73 13.91 14.08 14.24 14.40 
5| 10.48 10.70 10.89 11.08 11.24 11.40 11.55 11.68 11.81 11.93 
6| 9.0 9.48 9.65 9.81 9.95 10.08 10.21 10.32 10.43 10.54 
7| 8.55 8.71 8.86 9.00 9.12 9.24 9:35 9.46 9.55 9.65 
$| 8.03 8.18 8.31 8.44 8.55 8.66 8.76 8.85 8.94 9.03 
9! 7.65 7.78 7.91 8.03 8.13 8.23 8.33 8.41 8.49 8.57 
10| 7.36 7.49 7.60 T.H 7.81 7.91 7.99 8.08 8.15 8.23 
11| 7.13 7.25 7.36 7.46 7.56 7.65 7.73 7.81 7.88 7.95 
12| 6.94 7.06 7.7 7.26 7.36 7.44 7:52 7.59 7.66 7.73 
13| 6.79 6.90 7.01 7.10 7.19 7.27] 7.35 7.42 7.48 7.55 
14| 6.66 6.77 6.87 6.96 7.05 7.13 7.20 7.27 7.33 7.39 
15| 6.55 6.66 6.76 6.84 6.93 7.00 7.07 7.14 7.20 7.26 
16| 6.46 6.56 6.66 6.74 6.82 6.90 6.97 7.03 7.09 7.15 
17| 6.38 6.48 6.57 6.66 6.73 6.81 6.87 6.94 7.00 7.05 
18) 6.31 6.41 6.50 6.58 6.65 6.73 6.79 6.85 6.91 6.97 
19| 6.25 6.34 6.43 6.51 6.58 6.65 6.72 6.78 6.84 6.89 
20| 6.19 6.28 6.37 6.45 6.52 6.59 6.65 6.71 6.77 6.82 
24| 6.02 6.11 6.19 6.26 6.33 6.39 6.45 6.51 6.56 6.61 
30| 5.85 5.93 6.01 6.08 6.14 6.20 6.26 6.31 6.36 6.41 
40| 5.69 5.76 5.83 5.90 5.96 6.02 6.07 6.12 6.16 6.21 
60| 5.53 5.60 5.67 5.73 5.78 5.84 5.89 5.93 5.97 6.01 
120| 5.37 5.44 5.50 5.56 5.61 5.66 5.71 3:75 5.79 5.83 
oo 5.23 5.29 5.35 5.40 5.45 5.49 5.54 3.57 5.61 5.65 
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TABLE7 Percentage Points of the Bonferroni t Statistic: t^" 


a= .05 
























































5 | 4.78 | 5.25 | 5.60 | 5.89 | 6.15 | 6.36 | 6.56 | 6.70 | 6.86 
7 | 4.03 | 436 | 4.59 | 4.78 | 4.95 | 5.09 | 5.21 | 5.31 | 5.40 
10 | 3.58 | 3.83 | 4.01 | 4.15 | 4.27 | 4.37 | 4.45 | 4.53 | 4.59 
12 | 3.43 | 3.65 | 3.80 | 3.93 | 4.04 | 4.13 | 4.20 | 426 | 4.32 
15 | 3.29 | 3.48 | 3.62 | 3.74 | 3.82 | 3.90 | 3.97 | 4.02 | 4.07 


20 | 3.16 | 3.33 | 3.46 | 3.55 | 3.63 | 3.70 | 3.76 | 3.80 | 3.85 
24 | 3.09 | 3.26 | 3.38 | 3.47 | 3.54 | 3.61 | 3.66 | 3.70 | 3.74 
30 | 3.03 | 3.19 | 3.30 | 3.39 | 3.46 | 3.52 | 3.57 | 3.61 | 3.65 
40 | 2.97 | 3.12 | 323 | 3.31 | 3.38 | 3.43 | 3.48 | 3.51 | 3.55 
60 | 2.92 | 3.06 | 3.16 | 324 | 3.30 | 3.34 | 3.39 | 3.42 | 3.46 


120 | 2.86 | 2.99 | 3.09 | 3.15 | 322 | 3.27 | 3.31 | 3.34 | 3.37 
oo | 2.81 | 2.94 | 3.02 | 3.09 | 3.15 | 3.19 | 3.23 | 3.26 | 3.29 





























7.51 
5.79 
4.86 
4.56 
4.29 


4.03 
3.91 
3.80 
3.70 
3:39 


3.50 
3.40 





8.00 
6.08 
5.06 
4.73 
4.42 


4.15 
4.04 
3.90 
3.79 
3.69 


3.58 
3.48 


8.37 
6.30 
5.20 
4.86 
4.53 


4.25 
4.1 

3.98 
3.88 
3.76 


3.64 
3.54 





8.68 
6.49 
5.33 
4.95 
4.61 


4.33 
4.2 

4.13 
3:93 
3.81 


3.69 
3.59 





8.95 
6.67 
5.44 
5.04 
4.71 


4.39 
4.3 

4.26 
3.97 
3.84 


3.73 
3.63 





9.19 
6.83 
5.52 
5.12 
4.78 


4.46 
4.3 
4.1 
4.01 
3.89 


3.77 
3.66 





9.41 
6.93 
5.60 
5.20 
4.84 


4.52 
4.3 
4.2 
4.1 
3.93 


3.80 
3.69 





9.68 
7.06 
5.70 
5.277 
4.90 


4.56 
4.4 
4.2 
4.1 
3:97 


3.83 
3.72 
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TABLE 8 Critical Values of Smaller Rank Sum for the Wilcoxon Mann-Whitney Test 







































































a for a for nj (Smaller Sample) 
Two-Sided | One-Sided 
n» | Test Test 1|2 3 4 5 6 7 8 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 
3 | .20 .10 3 7 
.10 .05 6 
.05 .025 
.01 .005 
4 | .20 .10 3 7 13 
JL0 .05 6 11 
.05 .025 10 
.01 .005 
5 | 20 .10 4 8 14 | 20 
A0 .05 3 7 12 | 19 
.05 .025 6 11 | 17 
.01 .005 15 
6 | 20 .10 4 9 15 | 22 | 30 
.10 .05 3 8 13 | 20 | 28 
.05 .025 7 12 | 18 | 26 
.01 .005 10 | 16 | 23 
7^ | 20 .10 4 10 16 | 23 | 32 | 41 
.10 .05 3 8 14 | 21 | 29 | 39 
.05 .025 7 13 | 20 | 27 | 36 
.01 .005 10 | 16 | 24 | 32 
8 | .20 .10 5 11 17 | 25 | 34 | 44 | 55 
.10 .05 4 9 15 | 23 | 31 | 41 | 51 
.05 .025 3 8 14 | 21 | 29 | 38 | 49 
.01 .005 11 | 17 | 25 | 34 | 43 
9 | .20 .10 1,5 11 19 | 27 | 36 | 46 | 58 | 70 
.10 .05 4 | *10 16 | 24 | 33 | 43 | 54 | 66 
.05 .025 3 8 14 | 22 | 31 | 40 | 51 | 62 
.01 .005 6 11 | 18 | 26 | 35 | 45 | 56 


(continued) 
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TABLE 8 Critical Values of Smaller Rank Sum for the Wilcoxon Mann-Whitney Test (Continued) 







































































a for a for nj (Smaller Sample) 
Two-Sided | One-Sided 
n» | Test Test 1 2 3 4 3 6 7 8 9 10 1l 12 13 14 15 16 | 17 | 18 | 19 | 20 
10 | .20 .10 1 6 | 12 20 | 28 | 38 | 49 | 60 | 73 87 
.10 .05 4 | 10 17 | 26 | 35 | 45 | 56 | 69 82 
.05 .025 3 9 15 | 23 | 32 | 42 | 53 | 65 78 
.01 .005 6 12 | 19 | 27 | 37 | 47 | 58 71 
11 | .20 .10 1 6 | 13 21 | 30 | 40 | 51 | 63 | 76 91 | 106 
.10 .05 4| 11 18 | 27 | 37 | 47 | 59 | 72 86 | 100 
.05 .025 3 9 16 | 24 | 34 | 44 | 55 | 68 81 96 
.01 .005 6 12 | 20 | 28 | 38 | 49 | 61 73 87 
12 | .20 .10 1 7 | 14 22 | 32 | 42 | 54 | 66 | 80 94 | 110 | 127 
.10 .05 5| 11 19 | 28 | 38 | 49 | 62 | 75 89 | 104 | 120 
.05 .025 4 | 10 17 | 26 | 35 | 46 | 58 | 71 84 99 | 115 
.01 .005 7 13 | 21 | 30 | 40 | 51 | 63 76 90 | 105 
13 | 20 .10 1 7 | 15 23 | 33 | 44 | 56 | 69 | 83 98 | 114 | 131 | 149 
.10 .05 5,12 20 | 30 | 40 | 52 | 64 | 78 92 | 108 | 125 | 142 
.05 .025 4 | 10 18 | 27 | 37 | 48 | 60 | 73 88 | 103 | 119 | 136 
.01 .005 7 | *13 | 22 | 31 | 41 | 53 | 65 79 93 | 109 | 125 
14 | .20 .10 1| *8 | 16 25 | 35 | 46 | 59 | 72 | 86 | 102 | 118 | 136 | 154 | 174 
.10 .05 *6 | 13 21 | 31 | 42 | 54 | 67 | 81 96 | 112 | 129 | 147 | 166 
.05 .025 4| 11 19 | 28 | 38 | 50 | 62 | 76 91 | 106 | 123 | 141 | 160 
.01 .005 7 14 | 22 | 32 | 43 | 54 | 67 81 96 | 112 | 129 | 147 
15 | .20 .10 1 8 | 16 26 | 37 | 48 | 61 | 75 | 90 | 106 | 123 | 141 | 159 | 179 | 200 
.10 .05 6 | 13 22 | 33 | 44 | 56 | 69 | 84 99 | 116 | 133 | 152 | 171 | 192 
.05 .025 4] 11 20 | 29 | 40 | 52 | 65 | 79 94 | 110 | 127 | 145 | 164 | 184 
.01 .005 8 15 | 23 33 | 44 | 56 | 69 84 99 | 115 | 133 | 151 | 171 
16 | 20 .10 1 8 | 17 27 | 38 | 50 | 64 | 78 | 93 | 109 | 127 | 145 | 165 | 185 | 206 | 229 
.10 .05 6 | 14 24 | 34 | 46 | 58 | 72 | 87 | 103 | 120 | 138 | 156 | 176 | 197 | 219 
.05 .025 4| 12 21 | 30 | 42 | 54 | 67 | 82 97 | 113 | 131 | 150 | 169 | 190 | 211 
.01 .005 8 15|24134 1 46 1 58 | 72 86 | 102 | 119 | 136 | 155 | 175 | 196 
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TABLE 8 Critical Values of Smaller Rank Sum for the Wilcoxon Mann-Whitney Test (Continued) 











a for a for nı (Smaller Sample) 
Two-Sided | One-Sided 
n | Test Test 1| 2| 3| 4| 5| 6| 7| 8 9| 10| 1l 12 13 14| 15| 16| 17| 18| 19 | 20 
17 | .20 .10 1| 9 | 18 | 28 | 40 | 52 | 66 | 81 | 97| 113 | 131 | 150 | 170 | 190 | 212 | 235 | 259 
.10 .05 6 | 15 | 25 | 35 | 47 | 61 | 75 | 90 | 106 | 123 | 142 161 182 | 203 | 225 | 249 
.05 .025 5 | 12 | 21 | 32 | 43 | 56 | 70 | 84 | 100 | 117 | 135 154 | 174 | 195 | 217 | 240 
.01 .005 8 | 16 | 25 | 36 | 47 | 60) 74| 89 | 105 | 122 140 | 159 | 180 | 201 | 223 
18 | 20 .10 1| 9} 19 | 30 | 42 | 55 | 69 | 84 | 100 | 117 | 135 | 155 175 196 | 218 | 242 | 266 | 291 
.10 .05 7 | 15 | 26 | 37 | 49 | 63 | 77 | 93 | 110 | 127 | 146 | 166 | 187 | 208 | 231 | 255 | 280 
.05 .025 5 | 13 | 22 | 33 | 45 | 58 | 72 | 87 | 103 | 121 | 139 158 179 | 200 | 222 | 246 | 270 
.01 .005 8 | 16 | 26 | 37 | 49 | 62 | 76 | 92 | 108 | 125 144 | 163 | 184 | 206 | 228 | 252 
19 | 20 .10 2 | 10 | 20 | 31 | 43 | 57 | 71 | 87 | 103 | 121 | 139 | 159 180 | 202 | 224 | 248 | 273 | 299 | 325 
.10 .05 1| 7 | 16 | 27 | 38 | 51} 65] 80} 96) 113| 131 | 150 | 171 192 | 214 | 237 | 262 | 287 | 313 
.05 .025 5 | 13 | 23 | 34 | 46 | 60 | 74 | 90 | 107 | 124 | 143 163 | *183 | 205 | 228 | 252 | 277 | 303 
.01 .005 3 | 9 | 17 | 27 | 38 | 50] 64) 78 | 94 | 111 | 129 | *148 168 | 189 | 210 | 234 | 258 | 283 
20 | .20 .10 2 | 10 | 21 | 32 | 45 | 59 | 74 | 90 | 107 | 125 | 144 | 164 | 185 | 207 | 230 | 255 | 280 | 306 | 333 | 361 
.10 .05 1| 7 | 17 |} 28 | 40 | 53 | 67 |] 83 | 99 | 117 | 135 | 155 175 197 | 220 | 243 | 268 | 294 | 320 | 348 
.05 .025 5 | 14 | 24 | 35 | 48 | 62 | 77 | 93 | 110 | 128 | 147 167 188 | 210 | 234 | 258 | 283 | 309 | 337 
.01 .005 3: 9] 18 | 28 | 39 | 52 | 66] 81] 97 | 114 | 132 151 172 | 193 | 215 | 239 | 263 | 289 | 315 












































For larger values of nı and no, critical values are given to a good approximation by the formula: 


nin»(nj + n» + 1) j 


where z = 1.28 for œ = .20 (two-sided test) 


z = 1.64 for æ = .10 (two-sided test) 
z = 1.96 for à = .05 (two-sided test) 
z = 2.58 for à = .01 (two-sided test) 


ni 
z +m+!1) 4 
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* Values have been corrected to the values given by D. B. Owen, Handbook of Statistical Tables, copyright 1962, Addison-Wesley Publishing Co., Inc. 
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TABLE9 Critical Values of W, (n) for the Wilcoxon Signed-Ranks Test 


Wa is the integer such that the probability that W < W, is closest to œ. For example, 
for n = 8, P(W x 3) = .020 and P(W x 4) = .027; therefore, W o25(8) = 4. 





a for One-Sided Test 




















.025 .01 .005 
a for Two-Sided Test 

n .05 .02 .01 
6 0 — — 
7 2 0 — 
8 4 2 0 
9 6 3 2 
10 8 5 3 
11 11 7 5 
12 14 10 7 
13 17 13 10 
14 21 16 13 
15 25 20 16 
16 30 24 20 
17 35 28 23 
18 40 33 28 
19 46 38 32 
20 52 43 38 
21 59 49 43 
22 66 56 49 
23 73 62 55 
24 81 69 61 
25 89 TI 68 

For large n, 

n(n + 1) n(n 4- 1)n + 1) 
Wp(n) = 1 Zip 24 


approximately, where z is given in Table 2. 
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Answers to Selected 
Problems 


Following are answers to those odd-numbered problems for which a short answer can 
be given. No proofs, graphs, or extensive data analysis are given. 


Chapter 1 


1. a. Q = (hhh, hht, htt, hth, ttt, tth, thh, tht} 
b. A = (hhh, hht, hth, thh} 
B = (hht, hhh} 
C = (hht, htt, ttt, tht} 
C. A‘ = (htt, ttt, tth, tht} 
AN B= (hht, hhh} 
AUC = {hhh, hht, hth, thh, htt, ttt, tht} 


3. Q= (rrr, rrg, rrw, rwg, rgw, rgr, rwr, rgg, 88r, gU, grr, grw, 
gWr, erg, gwg, wrr, weg, wrg, wer} 


5. Q— (An B) n (AU B) 9. Not 50% 11. 7x 6x 5 x 4/104 
13. a. 10(4° — 4)/(Ẹ) b. 13 x 48/(%) c. 13x 12x4x6/(7) 
15. 72 19.a.5x3x2x92/(7) b. 240/(7) 

21. i 23. n(n — 1) 25. 6 


27. 26 x 25 x 24 x 23 x 22/26 29. [ou 3133 «dw 


33. 1x 6x 5x Ax 3/7 37. 210 

39. a. 211/26! b. 1.818 x 107 

41. a. [C) - 2) - OOT/C2) b. (3)/(4) 

4. (3 47. a. 11/45 b. 6/11 
49. a. 4/7 b. 3/7 51. 2/5 


Answers to Selected Problems A33 


53. 0.35 55. a. 48, .70 b. .064, .614, .322 
57. 2/3 59. a. 2/3 b. 5/6 
61. .86 63. 1/3 69. Yes 


73. S (5) plu1—p)  75.p?—2p?--1;.597 77. 14 
j-k 
79. a. P(aa) = 1/4, P(Aa) = 1/2, P(AA) = 1/4 
b. 2/3 
c. P(aa) = p/6, P(Aa) = 1/3 + p/6, P(AA) = 2/3 — p/3 
d. pe = [0 — p/4Q/3)/Q — p/6) 


Chapter 2 

3. p(1) 2.1, p) 2 2, p(3) 2.4, p(4) 2 .1, p($)2 2 
0 x <0 

7. F(x) 2 4 1- p Oxx«l 9. p « .5 11. [n+ 1) p] 
1 x>1 

13. a. .0130 b. .2517 15. 3of 5 

17. P(X =k) 2 p(l— př k =0, L... 19. F(n) 21—(1— p)" 

23. ao i ) p'Q- py 25. a. .9987 b. 9 x 1077 


27. p(k) = 100*e-'? /k!. approximately — 29. P(X < 4) = .532104 
31. a. 28 b. 20.79min — 33. f(x) — afix-!exp(-ax^) 37. 2/3 


39. b. f(x) = [x(12- x2], -oo < x < oo c. 3.08 

41. —log(1/4)/A, —log(3/4)/A 43. f(x) = Aux x? exp(—AAzx?/3) 
45. a. 1— e! b.e?-—e15 c. 46.1 

53. a. 0.3085 b. 0.8351 Ce 21:5 

55. c = 1.960 59. f(x) 2 x V2 

61. (./cy*t*7! exp( —At/c)/ T (a) 63. [x (1 4 x2)]7! 





65. X = [—-1+4+ 2/1/4 — a (1/2 — a/4 — U)]/a, where U is uniform 
67. a. f(x) = (B/a?)xP! exp(—(x/a)*) 
69. f(x) = (4/3) 8/An) P x ?P? exp(—A (3x /4sr)!/3) 


Chapter 3 


1. a. py = .19, po = .32, p3 = .31, ps = .18, for both X and Y 
b. p(1]1) = .526, p(2|1) = .263, p(3]1) = .105, p(4|1) = .105, for both 
X and Y 


A34 


Answers to Selected Problems 


15. 


17. 


19. 
23. 
29. 
33. 
43. 
49. 
55. 
57. 


61. 


63. 


67. 


. Multinomial, n = 10, p, = p; = p3 = 1/3 
- fxy Go, y) = ap exp[—ax — By]; f(x) = aw exp[-ox], fy(y) = B exp[—By] 
.a fx(x) = 3(1—x7)/4, -1€ x <1, fr 23/1—y/2, 0x yxl 


b. fxyGly) = 1/QA1— y), frix lo) = 1/0 — x?) 


. 1/9 
13. 


p(0) = 1/2, p(1) = p) = 1/4 
" 2/21 
2/2 


d. f=- y), -1<y<1 
fxG)-2$0-x), -l<x<1 


X and Y are not independent. 


vV1-x?— y? 


m(1 — x2) 


V1-x?—y? 


nil y?) 


b. fxx)21-|xl, -1xx x Eb fyQ) =1-lyl,-l<y<1 
c fxy(xly -21/2—2lyD, 1—ly| x € 1 lyl 
frxGlx)=1/2=2\|x), 1— |x| < y & 1 - [x] 


a. B/(a + B) b. B/Qo + p) 

Binomial (m, pr) 

h(x, y) = Ape **e [1 + (1 — 2e ^)(1 — 2e7^)] 
a. few (0|n) = n(n + Dod 2 0)! 

fs(s) =s forO<s <land=2-—sforl<s <2 


Xe? 5I. — Ag 53. 5/9 


a. c — 3/2n 


e. fyx (lx) = 


fxir(x ly) = 





fxr, y) = Q2 - y?) "^, x? Ey? <1 


XQp—ypDx—-—ycty 


1 u—a v—c 
fov.) = gf ( Bet wg ) 





u+v u-—v 
2° 2 





1 
a fov = 5 fur ( ) where U=X+¥, vo xy 


b. fuv(u, v) = zii fier uo" (u/v)!?) where U = XY, V = X/Y 


f(t) = n(n — 1)A[exp(—(1 — 1)At) — exp(—nAt)] 
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69. nBvP-!a7? exp(—n(v/a)?) 71. 1— y^ 
75. Let U = Xi); y= X(j 
n! 
ov) = «Igi Da - pi 
x [FG]! f@LFO) — FY" f (v) — FQ) 
77. n(1 — x7! 79. Exponential (A) 
81. a. n/(n 4- 1) b. (n — D/(n 4 1) 
Chapter 4 
3. E(X) — 3.1; Var(X) = 1.49 
5. E(X) =a/3; Var(X) = 1/3— a? /9 
7. a. E(X) — 5/8 
b. py(0) = 1/2, py (1) = 3/8, py(4) = 1/8, E(Y) — 7/8 
c. E(X2) — 7/8 d. Var(X) — 31/64 
9. That value of n such that s Y, p(k) > es p(k) and s » p(k) «c x p(k) 
k=n k=1 k=n+1 k=1 


55. 
57. 


. Cov(X, Z) = —og; Corr(X, Z) = — 


. It makes no difference. 


. à. E(Xq5) — k/(n * 1) 
b. Var(X) = k(n — k + D/L + D*( + 2)] 
. 1/(n+ 1) 21. 1/3 23. 2/2? (square), 1/2? (rectangle) 
. 2a (a + 1)/22 27. 1 31. no 
.r/p 37. p> (1/k)"* 
. a. 4606 b. 10,000 
. The expected number of occurrences is 4.62. Using Markov's inequality, the 


chance of 100 or more occurrences is less than 0.0462, so you should be surprised. 


. Cov(N;, Nj) = —npipj 


Ox 


(of + o2)? 


. b. a = o?/(0} + ož) 


c. (X + Y)/2is better when 1/3 < 02/07 < 3. 


. m; = n`! for the optimal portfolio. If each individual return has standard deviation 


c, the standard deviation of the return from this portfolio is o /./n. If the entire 
investment is in one security, the standard deviation of the return is o. 


E(T) =n(n+ Du/2; Var(T) = n(n + Dn + Do?/6 


2,2 22 22 
OxOy + MYOy + My Oy 


A36 Answers to Selected Problems 
61. a. Cov(x, Y) = 1/36; Corr(X, Y) = 1/2 
. E(X|Y) = Y/2, E(Y|X) = (X + 1)/2 
c. If Z = E(X|Y), the density of Z is fz(z) = 8z, 0 < z < 1/2 
If Z = E(Y|X), the density of Z is fz(z) = 8(1—z), 1/2xz x 1 
d.?- 1 + iX ; the mean squared prediction error is 1/24 


c9 


e. Y= i + 5X ; the mean squared prediction error is 1/24 


63. a. Cov(X, Y) = —.0085; pxy = —.1256 
b. E(Y|X) = (6X? + 8X + 3)/[4G X? + 3X + 1)] 


65. In the claim that E(T |N =n) = nE(X) 67. 3/2, 1/6 
71. pyx |x) is hypergeometric. E(Y|X = x) = mx/n 
73. np(l + p) 75. a. 1/2); b. 5/12) 


77. E(X|Y) = Y/2, E(Y|X) 2 X 1 


79. M(H) = $ + $e + te” 81. M(t)=1— p+ pe! 
85. M(t) = e'p/[1 — (1 — p)e']; E(X) = 1/p; Var(X) = (0 — p)/p? 
87. Same p 93. Exponential 


99. b. Elg(X)] © logu — o?/2p°; Var[g(X)] & o?°/u? 
101. E(Y) © VA — 1/(8VA); Var(Y) ~ 1/4 103. .0628 mm 





Chapter 5 
3. .0228 13. N(0, 150,000); most likely to be where he started 
15. p — .017 17. n — 96 
5 1 2 ftx) 2 
21. b. Var(/(f)) = dx — I (f) 
n|Ja g(x) 


29. Let Z, = n(Um — 1). Then P(Z, < z) > e, -l1<z<0 


Chapter 6 
3. c2 17 9. E(S?) = o?; Var(S?) = 204/(n — 1) 


Chapter 7 


1. p(1.5) = 1/5, p(2) — 1/10, p(2.5) = 1/10, p(3) =1/5, p(4.5) = 1/10, 
p(5) = 1/5, p(6) = 1/10; E(X) = 17/5; Var(X) = 2.34 


3. d,f, h 7. n = 319, ignoring the fpc 
9. SE = .026. CI: (.05, .15) 11. a. 6 samples. b. Yes 


15. 


17. 
21. 


29. 


31. 
33. 
35. 


37. 


39. 


41 


43. 


47. 
49. 


53. 


Answers to Selected Problems A37 








b n Ay A^» 

20 211.6 86.8 

40 145.6 59.7 

80 96.9 39.8 
no 19. 1.28, 1.645 
The sample size should be multiplied by 4. 


A. R-t(ü- p) M i 
a. Q = — — — ——-, where f = probability of answering yes to unrelated 


question 
c. Var(Q) = r(1 — r)/(np?), where r = P(yes) = qp -t(1— p) 


n = 395 

The sample size for each survey should be 1250. 
a. X — 98.04 

NE = 15964 E (1- =) =5.28 


N 
c. 98.04 + 4.50 and 196,080 + 9008 


b. s? 

















aa+fp=1 
2 a. 
b. a = Tu B-— 
"X|1? X; OX, +o X; 
N—k)(N—k-—1)---(N- k—1 
Choose n such that p — 1 ( X E da 23 which 


N(N = -N — kc 1) 
can be done by a recursive computation; n — 581 
2 


N 2 
. b. — (a4 +03 - 2p0405) 


o5 
20 AOB i 
d. The ratio estimate is biased. The approximate variance of the ratio estimate 


c. The proposed method has smaller variance if p > 





is greater if ae > 1. 


HB 
R= 5 = .73, sr = .02, 73 + .04 
The bias is .96 for n = 64 and .39 for n = 128. 
Ignoring the fpc, 
a. R = 31.25; b. sr = .835; 31.25 + 1.637; 





c. T = 107; 107 + 5,228,153; d. sr, = 266,400, which is much better. 





a. For optimal allocation, the sample sizes are 10, 18, 17, 19, 12, 9, 15. For 
proportional allocation they are 20, 23, 19, 17, 8, 6, 7. 


b. Var(Xso) = 2.90, Var(X,,) = 3.4, Var(X,,,) = 6.2 


A38 Answers to Selected Problems 


55. a. Xy + `X, 
b. 0.68 
c. No, the standard error would be 0.87. 
d. No, the standard error would be 0.71. 


57. p(2.2) = 1/6, p(2.8) = 1/3, p(3.8) = 1/6, p(4.4) = 1/3; E(X,) = 3.4; 
Var(X,) = .72 


61. a. WU; + w + ws; = Oand w; + 2w: + 3w3 = 1 
b. wı = —1/2, w = 0, w; = 1/2 





Chapter 8 
3. For concentration (1), 
a. À = .6825; b. .6825 + .081; 
c. There are not gross differences between observed and expected counts. 
5. a. 0 = 1/3 b. Lik(@) = 0(1— 0)? 
c. 6 = 1/3 d. (2, 3) 
7. a. p=1/X b. p—1/X 
c. Var(p) ~ p?(1 — p)/n 
d. The posterior distribution is 8 (2, k); the posterior mean is 2/(k + 2). 


13. P(Jà| > .5) e .1489 
17. b. & = n(8X2.,X2 — 2n)! — 1/2 


Tl’ (2a) aic 1 Y logiX (d — X;)] 20 
i=l 





© Too Tia) 2n 








( [Pore —I'(ay  2I"(o)TQo)— ree) 
d. | 2n 
T (o)? T (2)? 
19. a. 6 = Vn E (X; — u) b. f@=X C. no 
21. a. X—1 b. min(X;, X2», ..., Xn) c. min(X,, X5, ..., Xn) 


23. Method of moments estimate is 1775. MLE is 888. 


27. Let T be the time of the first failure. 


5t 
a ep (7) b. ĉ=5T 
T 
(=) 
c. ĉ ~ exp ( — d. oz =T 
T 
31. a. 3p(1 — p)é b. 21/7 


33. Let q be the .95 quantile of the ¢ distribution with n — 1 df; c = qsy. 


41. Fora the relative efficiency is approximately .444; for À it is approximately .823. 


47. 


49. 


53. 


55. 


57. 


59. 
63. 


Answers to Selected Problems A39 


. ô = X/(X — xo) 
. 6 = n/(Xlog X; — n log xo) 
c. Var(0) © 0? /n 


c» 


a. Let p be the proportion of the n events that go forward. Then à = 4p — 2. 
b. Var(à) = (2—a)(2+a)/n 


a. 0 22X; E(0) = 0; Var(6) = 0? /3n 

b. 6 = max(X,, X,..., Xn) 

c. E(Õ) = n0/(n+ 1); bias = —0/(n + 1); Var(0) = n0? /(n + 2)(n + 17; 
MSE = 20?/ (n + 1)(n + 2) 

d. 0* = (n + 1)6/n 

a. Let nj, n2, n3, n4 denote the counts. The mle of 0 is the positive root of the 
equation 

(ny +n +n + n4)6 — (nı — 2n; — 2n3 — n4)0 — 2n, = 0 

The asymptotic variance is Var(6) = 2(2 + 0)(1— 0)0/(n, +n + n3 4- n4) 
(1 + 0). For these data, Ó = .0357 and Sg = .0057. 

b. An approximate 95% confidence interval is .0357 + .0112. 








a. s? is unbiased. b. ô? has smaller MSE. c. p = 1/(n + 1) 
b. à = (ni4- n5 — n3)/(ni4- na 4- n3) if this quantity is positive and 0 otherwise. 


In case (1) the posterior is 6(4, 98) and the posterior mean is 0.039. In case (2) 
the posterior is 8 (3.5, 102) and the posterior mean is 0.033. The posterior for 
case (2) rises more steeply and falls off more rapidly than that of case (1). 


65. mo = 16.25, & = 80 71. [[ d+ Xj) 
i=l 
73. 3x7 
i=l 

Chapter 9 

1. a. a = .002 b. power — .349 

3. a. a = .046 5. F, F, F, F, F, F, F, T 

7. Reject when ` X; > c. Since under Ho, 5; X; follows a Poisson distribution 


17. 


19. 


with parameter nA, c can be chosen so that P X; > c|Ho) =a. 


. Fora = .10, the test rejects for X > 2.56, and the power is .2981. For œ = .01, 


the test rejects for X > 4.66, and the power is .0571. 


a. LR = “exp [5x7(4 — 4)]. A level a test rejects for X? > of x? (a). 
z ci % 

b. Reject for $7; , X? > og x7(@) C. Yes 

a. X « 2/3 b. Reject for large values of X 


c. Reject for X > y1 — æ d. 1— (1 — 0)?/2 


A40 


Answers to Selected Problems 


21. 


23. 
29. 


33. 


35. X? 
37. X? 


39. 


41. 


43. 


45. 


51. 
57. 


59. 


a. Reject for X > 1; power = 1/2 

b. Significance level = a, power = 1 — a/2 

c. Significance level = a, power = 1 — a/2 

d. Reject when (1 — a)/2 x X < (l+a)/2 

e. Foro > 0, the rejection region is not uniquely determined. 

f. The rejection region is not uniquely determined. 

yes 25. —2log A = 54.6. Strongly rejects 27. >12.02 

yes 31. 2.6 x 107!, 9.8 x 102, 3 x 107^, 7 x 1077 

—2 log A and X? are both approximately 2.93. .05 < p < .10; not significant 


for Chinese and Japanese; both ~ .3. 
= .0067 with 1 df and p ~ .90. The model fits well. 


= 79 with 11 df and p ~ 0. The accidents are not uniformly distributed, 
apparently varying seasonally with the greatest number in November-January 
and the fewest in March-June. There is also an increased incidence in the summer 
months, July—August. 


x? = 85.5 with 9 df, and thus provides overwhelming evidence against the null 
hypothesis of constant rate. 


Let Pi = Xi/ni and P = 3 Xi/ on. Then 
: p) >=») 
ITAP — pom 


A= 





and 


— nj py 
E 


is approximately distributed as x? , under Hp. 


a. 9207 heads out of 17950 tosses is not consistent with the null hypothesis 
of 17950 independent Bernoulli trials with probability .5 of heads. (X? — 
11.99 with 1 df). 

b. The data are not consistent with the model (X? = 21.57 with 5 df, p ~ .001). 

C. A chi-square test gives X? — 8.74 with 4 df and p 7 .07. Again, the model 
looks doubtful. 


The binomial model does not fit the data (X? — 110.5 with 11 df). Relative to 
the binomial model, there are too many families with very small and very large 
numbers of boys. The model might fail because the probability of a male child 
differs from family to family. 


The horizontal bands are due to identical data values. 


The tails decrease less rapidly than do those of a normal probability distribution, 
causing the normal probability plot to deviate from a straight line at the ends by 
curving below the line on the left and above the line on the right. 


The rootogram shows no systematic deviation. 
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Chapter 10 


3. 
7. 


41. 


das ~ 63.4; qs X 63.6; q35 i 63.8 


Differences are about 50 days for the weakest, 150 days for the median. Can't 
tell for the strongest. 


F(x) 


. Bias & x I FO which is large for large x. 
. A(t) = apr?! 
. The uniform distribution on [0, 1] is an example. 
. h(t) = (24 — t)^!. It increases from 0 to 24. It is more likely after 5 hours. 
(n 4 1) (= = p) Xu + (n+ 1) (» - =) Xk+) 
n+1 n+1 
. b. ~ .018 c. & 18 d. ~ 2.4 x 107? 
.a. n" 


b. x 1/3 5/3 2 7/3 8/3 3 10/35 11/3 
po) 1/27 3/27 3/27 3/27 8/27 3/27 3/21 3/27 


. The mean and standard deviation 


37. 


Median = 14.57, x = 14.58, x i9 = 14.59, X29 = 14.59; s = .78, IQR/1.35 = 
.74, MAD/.65 = .82 


g=] 
The interval (Xg), X) covers x, with probability 57 (7) pa — py. 


Chapter 11 


7. 


11. 


13. 
15. 


19. 


21. 


Throughout. For example, all are used in the assertion that Var(X — Y) — 
o? (n^! + m~'). All are used in Theorem A and Corollary A. Independence 
is used in the expression for the likelihood. 


Use the test statistic 


, X -D-^ 
B 1 1 


The power of the sign test is .35, and the power of the normal theory test is .46. 


n = 768 


a. /2 b. Y — X > 2.33 c. 0.17 
d. Yes e. Y — X > 2.78; power = 0.11 
a. A pooled f test gives a p-value of .053. 


b. The p-value from the Mann-Whitney test is .064. 


A42 Answers to Selected Problems 


25. 


31. 


33. 


37. 


45. 


47. 


C. 


The sample sizes are small and normal probability plots suggest skewness, so 
the Mann-Whitney test is more appropriate. 


No b. No C. Yes, yes 


w 0 1 2 3 4 5 6 7 8 9 10 





p(w) .0625 .0625 .0625 .125 .125 .125 .125 .125 .0625 .0625 .0625 


lm+n+1 


Eft = 1/2; Var (t) = — —————— , which is smallest when m = n. 


12 mn 


Let 0 = o/o? and Ó = s2/s2. 


a. 


b. 


a. 


a. 


a. 


For Hj: 0 > 1, reject if ô > Fy-im—1(@). For Hy: 0 Æ 1, reject if ô > 
F,—1,m—1 (@/2) or 6 < E, i, a(1— @/2). 
A 100(1 — a)% confidence interval for 0 is 


^ ^ 


0 0 
Fiama Faim- Cl = a/2) 





. 9 = .60. The p-value for a two-sided test is .42. A 95% confidence interval 


for 0 is (.13, 2.16). 


For each patient, compute a difference score (after — before), and compare 
the difference scores of the treatment and control by a signed rank test or a 
paired ż test. A signed rank test gives for Ward A W} = 36, p = .124 and 
for Ward B W, = 22, p = .205. 


. To compare the two wards, use a two-sample f test or a Mann-Whitney test 


on difference scores. Using a Mann-Whitney test, there is strong evidence 
that the stelazine group in Ward A improved more than the stelazine group 
in Ward B (p — .02) and weaker evidence that the placebo group improved 
more in Ward A than in Ward B (p — .09). 


For example, for 1957 by a Wilcoxon signed rank test there is no evidence 
that seeding is effective (p — .73). For this and other years, it appears that the 
gain in seeding over not seeding may be greatest when rainfall in the unseeded 
area is low. 


. Randomization guards against possibly confounding the effect of seeding with 


cyclical weather patterns. Pairing is effective if rainfall on successive days is 
positively correlated; in these data, the correlation is weak. 


To test for an effect of seeding, compare the differences (target — control) to 
each other by a two-sample t test or a Mann-Whitney test. A Mann-Whitney 
test gives a p-value of .73. 


. The square root transformation makes the distribution of the data less skewed. 
. Using a control area is effective if the correlation between the target and 


control areas is large enough that the standard deviation of the difference 
(target — control) is smaller than the standard deviation of the target rainfalls. 
This was indeed the case. 


Answers to Selected Problems A43 


49. 95% CI: (8.9, 13.1). Null hypothesis is overwhelmingly rejected. 


51. 


53. 


The durations ofthe bottle-fed are typically much longer. Because the distribution 
is very skewed with some large outliers, a nonparametric test is preferable. The 
p-value from a signed-rank test is 0.012. 


The lettuce leaf cigarettes were controls to ensure that the effects of the experi- 
ment were due to tobacco specifically, not just due to smoking a lit cigarette. The 
unlit cigarettes were controls to ensure that the effects were due to lit tobacco, 
not just unlit tobacco. 











Chapter 12 
11. 
A B C D 
I 2 3 4 5 
I 3 4 5 6 
Il 4 5 6 7 
17. à; 2 Y; — Y 
à; 2 Yi -Y -Yj +Y... 
BY. 
19. Yi; = p + ot + Bj + Ve óij + Uj + Dik + Qijk + Eijk 


21. 


23. 


The main effects a;, j, y, satisfy constraints of the form o; = 0. The two- 
factor interactions, ó, v, and p, satisfy constraints ofthe form Ns ôij = X j 6;;= 0. 
The three-factor interactions, $;;,, sum to zero over each subscript. 


A graphical display suggests that Group IV may have a higher infestation rate 
than the other groups, but the F test only gives a p-value of .12 (F316 = 2.27). 
The Kruskal—Wallis test results in K = 6.2 with a p-value of .10 (3 df). 


For the male rats, both dose and light are significant (LH increases with dose and 
is higher in normal light), and there is an indication of interaction (p = .07) (the 
difference in LH production between normal and constant light increases with 
dose), summarized in the following anova table: 





Source df SS MS 
Dose 4 545549 136387 
Light 1 242189 242189 
Interaction 4 55099 13775 
Error 50 301055 6021 





The variability is not constant from cell to cell but is proportional to the mean. 
When the data are analyzed on a log scale, the cell variability is stabilized and 
the interaction disappears. The effects of light and dose are still clear. 


A44 Answers to Selected Problems 


25. The following anova table shows that none of the main effects or interactions are 





significant: 
Source df SS MS 
Position 9 83.84 9.32 
Bar 4 46.04 11.51 
Interaction 36 334.36 9.29 
Error 50 448.00 8.96 





There are some odd things about the data. The first reading is almost always larger 
than the second, suggesting that the measurement procedure changed somehow 
between the first and second measurements. One notable exception to this is 
position 7 on bar 50, which looks anomalous. 





27. Source df SS MS 
Species 2 836131 418066 
Error 131 446758 3410 
Total 133 1282889 





The variance increases with the mean and is stabilized by a square root trans- 
formation. The Bonferroni method shows that there are significant differences 
between all the species. 





29. Source Df SumSq MeanSq Fvalue p-value 
Furnace 2 4.1089 2.0544 1.4460 0.26159 
Wafer.Type 2 5.8756 2.9378 2.0678 0.15547 
Furnace x Wafer. Type 4 21.3489 5.3372 3.7566 0.02162 
Residuals 18 25.5733 1.4207 





Only interactions are significant. Lines are not parallel in the interaction plot, in 
which the relationship of thickness of external wafers to furnaces appears quite 
different than that of the other two wafer types. 


33. a. N/R50 and R/R50 b. N/R50 and lopro c. N/R50 and N/R40 


Chapter 13 
1. X? = 5.10 with 1 df; p < .025 
3. For the ABO group there is a significant association (X? — 15.37 with 6 df, 
p — .02), due largely to the higher than average incidence of moderate-advanced 


TB in B. For the MN group there is no significant association (X? — 4.73 with 
4 df, p — .32). 


5. X? — 6.03 with 7 df and p — .54, so there is no convincing evidence of a 
relationship. 


11 


13. 


15. 


17. 


19. 
23. 


Answers to Selected Problems A45 


. X? = 12.18 with 6 df and p = .06. It appears that psychology majors do a bit 


worse and biology majors a bit better than average. 


. In this aspect of her style, Jane Austen was not consistent. Sense and Sensibility 


and Emma do not differ significantly from each other (X? — 6.17 with 5 df 
and p — .30), but Sanditon I differs from them, and not being followed by 7 
less frequently and the not being preceded by on more frequently (X? — 23.29 
with 10 df and p — .01). Sanditon I and II were not consistent (X? — 17.77, 
df — 5, p « .01), largely due to the different incidences of and followed 
by I. 


. a. In both cases the statistic is 


—2logA- 239 5 Oj; log(O;;/ E;j) 
i j 


b. —2log A = 12.59 
c. —2log A = 16.52 


Arrange a table with the number of children of an older sister as rows and the 
number of children of her younger sister as columns. 


a. H,: jj = 1;,7;,. This is the usual test for independence, with 


2 2 
X = 1 (nij ES nin, j/n..) / (n.r j /n..) 
ij 
b. Ho: 3° mi; = >> 7j; is equivalent to Ho: 7;; = 7 ji. The test statistic is 
izj jzi 


X = 5 (nij — (nij + n5)/2) / (Qni; + nj)/2 
izj 
which follows a xs distribution under Ho. The null hypotheses of (a) and (b) 
are not equivalent. For example, if the younger sister had exactly the same 
number of children as the older, (a) would be false and (b) would be true. 


For males, X? — 13.39, df — 4, p — .01. For females, X? — 4.47, df — 4, 
p = .35. We would conclude that for males the incidence was especially high in 
Control I and especially low in Medium Dose and that there was no evidence of 
a difference in incidence rates among females. 


There is clear evidence of different rates of ulcers for A and O in both London 
and Manchester (X? — 43.4 and 5.52 with 1 df respectively). Comparing Lon- 
don A to Manchester A, we see that the incidence rate is higher in Manchester 
(X? = 91, df = 1), whereas the incidence rate is higher for London O than for 
Manchester O (X? — 204, df — 1). 


p=.01 21. A=3.77 


McNemar's test gives a chi-square statistic equal to 28.5. Comparing this to the 
chi-square distribution with 1 df, the result is highly significant: heavy exertion 
is associated with myocardial infarction. This design is similar to the cell phone 
study in that each subject acts as his own control. 
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25. 


27. 


29. 


a. The total incidence of myocardial infarction (MCI) is reduced by aspirin 
(X? = 26.4 with 1 df). The odds ratio is 0.58, which is a considerable re- 
duction in risk due to aspirin. The incidences of fatal and nonfatal are both 
significantly reduced as well (X? = 6.2, 20.43, df = 1). There is no indication 
that among those having MCI, the fatality rate was reduced (p-value = 0.32). 
The difference in the incidence of strokes was not statistically significant, 
X? = 1.67,df = 1, 

b. There is no evidence that total cardiovascular mortality is decreased by aspirin, 
but the reduction in mortality due to myocardial infarction is significant. 


The death penalty was given in 13% of the cases in which the victim was white 
and the defendant was not. In all other cases the death penalty was given only 
5-696 of the time. A chi-square test of independence yields a statistic equal to 
15.9 with 3 df, so the p-value is 0.001. Whether such a test is valid is debatable. 
The use of the test could be criticized on the grounds that these are all the data 
there are for the years 1993—97, the numbers speak for themeselves, and there 
is no plausible probability model on which to base probability calculations, like 
p-values. The use of the test could be defended by arguing that for a table with 
these row and column marginal totals, it would be very unlikely that there would 
be such variation of the proportions between rows if only chance were at work. 


It depends on how the sampling is done. If the number of males and females 
are determined prior to the sample being drawn, a test of homogeneity would be 
appropriate. If only the total sample size are fixed, a test of independence would 
be appropriate. Management won't care, because the qualitative nature of the 
conclusion would be the same in either case. 


Chapter 14 


1. 


5. 


13. 


b. log y = loga — bx. Let u = log y and v = logx. 


d. y! = ax™! + b. Letu = y 'andv— x^!. 


This can be set up as a least squares problem with the parameter vector B = 
(pi, p2, p3)” and the design matrix 


Orr OCOCre 
er Or OC oO 


The least squares estimate is Ê = (X7 X)~!XTY. This gives, for example, 





i= ys ee ee yy 
pra^ 4" 4? 4 a 


^ 2 1 (xo — x»? 
a. Var(jio) = o $ * y EM 





15. 
21. 
23. 
31. 
37. 


39. 
41. 
43. 


Answers to Selected Problems 





C. fio E sa fa 2(0/2), where 
1 _ z)2 1/2 
ee | n (xo — X) | 


n Pai) 
B= (Xxx)/(x) 
Place half the x; at —1 and half at +1. 
a. 85 b. 80 25. true 
Cov(U, V) 0 


A = 18.18, 54 = .14; 18.18 + .29 
B = —2.126 x 10^, sg = 1.33 x 10; —2.126 x 104+ 2.72 x 10? 





Neither a linear nor a quadratic function fits the data. 


One possibility is DBH versus the square root of age. 


A47 


A physical argument suggests that settling times should be inversely proportional 


to the squared diameter; empirically, such a fit looks reasonable. Using the model 
T = By + B;/D? and weighted least squares, we find (standard errors listed in 
parentheses) 





10 25 28 
Bo —403(1.59) 1.48 (2.50) 2.25 (2.08) 
hi 28672 (371) 18152 (573) 16919 (474) 





From the table we see that the intercept can be taken to be 0. 


. For 1998, RSS — .016. For the 1999 predictions, RSS — .055, which is much 
larger. The predicted values for 1999 appear unrelated to the observed values. 
The poor performance in 1999 of the predictions formed from the 1998 data is 
due to over-fitting—4 parameters were estimated from 5 data points. 


. a. There appear to be two regimes corresponding to durations less than or greater 
than 3 min, and it is best to fit separate linear regressions to each regime. 

b. For a duration of 2 min the prediction would be 54.3 min. The standard error 
of this fitted value is 1.04 min. But there are two parts to the prediction error: 
the error of the fitted value and the variability of a new observation around 
its expected value. This latter is measured by the standard deviation of the 
residuals, 5.9 min. For a duration of 4.5 min, the prediction is 80.3 min. The 
standard error of this prediction is 1.09 min and the residual standard deviation 
is 6.7 min. A 95% prediction interval is (67 min, 94 min). See problems 13 
and 14. 
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moment-generating functions and, 181 double exponential, 111 
random variable sequence and, 178 gamma, 53-54 
Convolution prior, 94 
of functions, 97 Rayleigh, 100 
of sequences, 96 Density curves, 389-391 
Copula Density function 
explanation of, 78-79 Cauchy distribution and, 413 
Farlie-Morgenstern, 83-84 explanation of, 47, 75-76 
Correlation generating random variables from, 92-04 
expected values and, 138-146 of normal distribution, 54—55 
least squares method and, 560—563 Dependent variables, 542 
scatterplots and, 146 Discrete random variables 
serial, 603 Bernoulli random variables and, 37-38 
Correlation coefficient binomial distribution and, 38—40 
explanation of, 142, 146 explanation of, 35-37, 84 
Pearson, 406 geometric and negative binomial distributions 
population, 221—222 and, 40-41 
squared multiple, 583 hypergeometric distribution and, 42 
Counting methods, 6-7 joint distributions and, 72-75 
Covariance Poisson distribution and, 42-47 
bilinear property of, 139-140 Disjoint, 3 
explanation of, 138-139 Dispersion, 401—402 
of least squares estimates, 573-574 Distributive laws, 4 
of linear combinations of random variables, Double exponential density, 111 
139, 140 Double exponential distribution, 323—324 


population, 221 
Covariance matrix, 568 


Cramér-Rao inequality, 300—302 Efficiency 
Critical values asymptotic relative, 299-300 
of smaller rank sum for the Wilcoxon Cramér-Rao lower bound and, 298-302 
Mann-Whitney test, A21—A23 Empirical cumulative distribution function, 
of W,.(n) for Wilcoxon signed-ranks 378-380 
test, A24 Empty set, 3 
Cross-covariance matrix, 572 Equilibrium, Hardy-Weinberg, 273—274, 283-285, 
Cumulative distribution function 293-294 
bivariate, 77 Estimated standard error, 213, 214, 219, 262 
Cauchy, 67 Estimation 
empirical, 378-380 likelihood and, 13, 361 
explanation of, 36, 48 optimal, 260 
joint behavior of random variables and, parameter, 257—260 
71-72 of population parameters, 206—214 
Cumulative normal distribution table, A7 of ratio, 220—227 
of o?, 575-576 
Events 
Data summarization complement of, 3 
boxplots and, 402—404 explanation of, 2-3 
cumulative distribution function and, independent, 23-26 
378-388 mutually independent, 24 
histograms, density curves, and stem-and-leaf Expectation 
plots and, 389-392 of continuous random variables, 118—121 
measures of dispersion and, 401—402 of functions of random variables, 121—124 
measures of location and, 392-401 of linear combinations of random variables, 
overview of, 377—378, 407—408 124-130 


scatterplots and, 404—407 of sample mean, 203-210 


Expected values 
approximate methods and, 161—166 
conditional expectation and prediction and, 
147-154 
covariance and correlation and, 138-146 
moment-generating function and, 155-161 
of random variable, 116-130 
variance and standard deviation and, 
130-137 
Experimental design. See also One-way layout; 
Two-way layout; specific experimental 
designs 
confounding and, 457—458 
FD&C Red No. 40 and, 455-456 
fishing expeditions and, 458 
Lanarkshire milk experiment and, 
453-454 
mammary artery ligation and, 452-453 
observational studies and, 457 
placebo effect and, 453 
Portacaval shunt and, 454—455 
randomization and, 456 
Experiments, 2 
Exponential density, 50—52 
Exponential distribution 
explanation of, A2 
memoryless property of, 51, 384 
probability plot for, 374 
reliability studies and, 375 
use of, 50-51 
Exponential family of probability distributions, 
308-309 
Extended multiplication principle, 8, 9 


Factorial design, 504—505. See also Two-way 
layout 
Factorization theorem, 306-310 
False negative, 17 
False positive, 17 
Farlie-Morgenstern family, 77—78, 83-84, 86 
F distribution, 194 
Fisher's exact test, 514—516 
Fractional factorial design, 505 
Frequency approach, 26 
Frequency function 
conditional, 87-88 
explanation of, 36 
marginal, 73 
Friedman's test, 503—504 
F tests 
applications of, 485 
examples of, 483—485 
explanation of, 478—483 
likelihood ratio test and, 483 
two-way layout and, 495 
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Gamma density 
Bayesian calculations and, 288 
explanation of, 53-54, 118-119 
prior and, 294 
Gamma distribution 
explanation of, 258-259, A2 
maximum likelihood and, 270-271 
method of moments and, 263-264 
moment-generating function of, 157, 159 
probability plots and, 357-358 
standardized, 186 
Gaussian distribution. See Normal distribution 
Generalized likelihood ratio tests, 339—341 
Geometric distribution, 40, A1 
Geometric random variables, 117 
Gibbs sampling, 297 
Goodness of fit. See also specific tests 
chi-square statistic and, 342-343 
coefficient of skewness and, 359 
examination of, 376 
examples of, 257, 517 
explanation of, 334, 361 
hanging rootgrams and, 349-352 
likelihood ratio tests and, 341—347 
for normality, 358-361 
probability plots and, 352-358 
visual method to assess, 372—373 


Hanging chi-gram, 352 
Hanging rootograms, 349-352 
Hardy-Weinberg equilibrium 
explanation of, 273—274, 283-285, 
293—294 
likelihood ratio tests and, 343—344 
Hazard function 
explanation of, 383-384 
use of, 412 
Heteroscedastic error, 554 
High posterior density (HPD) 
interval, 288 
Histograms 
explanation of, 389, 390 
hanging, 349—352 
multinomial distribution and, 74—75 
Observed vs. fitted values in, 349 
for population values, 200—201 
Hodges-Lehmann shift estimate, 468 
Homogeneity, chi-square test of, 516—520 
Homoscedastic error, 554 
Hypergeometric distribution, 42 
Hypotheses 
alternative, 334, 336, 424—425 
composite, 334 
null, 331, 332, 334 
simple, 334 
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Hypothesis tests 


choice of null hypothesis and, 335—336, 361 

comparing independent samples and, 424—425 

duality of confidence intervals and, 337—339, 
425-426 

evidence assessment and, 329—330 

likelihood ratio test and, 339—347 

Neyman-Pearson paradigm and, 331—337 

Poisson dispersion test and, 349 

specification of significance level and, 334—335 


Independence 


chi-square test of, 520-523 
explanation of, 23-26 


Independent continuous random variables, 


104-107 


Independent events, 23—26 
Independent random variables 


explanation of, 37, 84 
joint distributions and, 84-86 


Independent sample comparisons 


Bayesian approach and, 443—444 
explanation of, 421 
Mann-Whitney test and, 435—443 
normal distribution and, 421—433 
power and, 433-435 


Independent variables, 542 
Indicator random variables, 37—38 
Inequality 


Chebyshev's, 133, 1134 
Markov's, 121 


Inference 


about 6, 577—580 
Bayesian, 94—95, 297, 311 
linear least squares and, 585—587 


Integration, Monte Carlo, 179, 190 
Interactions, two-way layout and, 491—492, 500 
Intercept, 547—550 

Interquartile range (IQR), 402 

Intersection of events, 3 


Jackknife technique, 248 
Jacobian, 101—103 

Joint density, 79-81, 83-84 
Joint density function, 75-77 
Joint distribution function 


cumulative, 78-79 
explanation of, 72-73 


Joint distributions 


applications for, 71 

conditional distributions and, 87—96 
continuous random variables and, 75-84 
discrete random variables and, 72—75 
extrema and order statistics and, 104—107 


general case and, 99-104 
independent random variables and, 84—86 
overview of, 71—72 
sums and quotients and, 96-99 
Joint frequency distribution, 74 


Kernel probability density estimate, 390 
Kruskal-Wallis test, 488—489 
Kurtosis, coefficient of, 359 


Laplace's rule of succession, 326 
Large sample theory 
construction of confidence intervals and, 281 
for maximum likelihood estimates, 274—279 
Law of large numbers 
explanation of, 177-178 
illustrations of, 178-180 
Law of total expectation, 149 
Law of total probability 
application of, 91 
explanation of, 18-19 
Least squares estimates 
estimation of o?, 575-576 
explanation of, 397 
inference about 6, 577—580 
mean and covariance of, 573-574 
residuals and standardized residuals and, 
576-577 
standard statistical model and, 548 
vector-valued random variables and, 567—572 
Likelihood, 13-14, 361 
Likelihood ratio 
evaluation of, 330 
explanation of, 329—330, 341—342, 361 
Likelihood ratio tests 
application of, 347 
construction of, 339-340 
generalized, 339—341 
for multinomial distribution, 341—347 
Limit theorems 
central, 181—188 
extreme values and, 191 
law of large numbers and, 177-180 
Linear combinations, of random variables, 
124-130 
Linearization, 162 
Linear least squares 
explanation of, 545-547 
inference and, 585—587 
local linear smoothing and, 587—591 
matrix approach to, 564—567 
multiple regression and, 580—585 
simple linear regression and, 547—563 
statistical properties of estimates, 567—580 


Linear regression 
assessing fit and, 550—560 
correlation and, 560—563 
estimated slope and intercept and, 547—550 
explanation of, 545 
multiple, 580—585 
LINPACK, 564 
Local linear smoothing, 587—591 
Location measures 
arithmetic mean and, 393-395 
bootstrap to estimate variability and, 399—401 
comparison of, 398 
explanation of, 392-393 
median and, 395-397 
M estimates and, 397—398 
trimmed mean and, 397 
Location parameter, 130 
Log likelihood, 268—270, 278, 426—427 
Log linear models, 530 
Lognormal density, 69 
Lognormal distribution, 187, 373 
Lower bound 
Cramér-Rao, 300-302 
efficient, 302 


Mann-Whitney test 
confidence intervals and, 442 
explanation of, 435-437, 439 
illustrations of, 437—443 
Marginal density 
explanation of, 76 
joint density and, 79-81, 83-84 
method for finding, 79-80, 91 
Marginal density function, 76—77 
Marginal frequency function, 73 
Markov's inequality, 121 
Matched-pairs designs, 523-526 
Matrix approach, to linear least squares, 
564—566 
Matrix of transition probabilities, 19 
Maximum likelihood estimates 
confidence intervals from, 279—285 
examples of finding, 268—272, 289 
explanation of, 13-14, 267-268 
large sample theory for, 274—279 
of multinomial cell probabilities, 272-274 
negative binomial distribution and, 303-304 
posterior distribution and, 296 
sufficiency and, 308, 309 
variance of, 302 
Maxwell’s distribution, 121 
McNemar's test, 525, 526 
Mean 
arithmetic, 393-395 
of least squares estimates, 573-574 
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of normal density, 55 
population, 201, 217 
posterior, 288 
sample, 195—198, 203-210, 220, 397 
testing of normal, 339-340 
trimmed, 397 
Mean squared error 
explanation of, 136 
predictor of, 154 
Mean vector, 568 
Measurement error 
central limit theorem and, 186-187 
model for, 135-137 
random vs. systematic, 135 
size of, 136-137 
Measurements 
dispersion, 401—402 
location, 392—401 
repeated, 179-180 
standards of, 135 
Measurement variability, 508 
Median, 395-398 
Median absolute deviation from the median 
(MAD), 402 
Memoryless, 51, 384 
M estimates, 398, 591 
Method of moments 
consistency and, 266-267 
examples of, 261—264 
explanation of, 260—261, 267 
sampling distributions and, 264—265 
Mode 
explanation of, 635 
posterior, 288 
Moment-generating function 
convergence and, 181 
explanation of, 155 
limitation of, 161 
properties of, 155-159, 195 
random sums and, 160-161 
Monte Carlo integration, 179, 190 
Multinomial coefficients, 15 
Multinomial distribution 
explanation of, 73-74 
likelihood ratio tests for, 341—347 
Multiple linear regression 
explanation of, 580-584 
use of, 584—585 
Multiplication law 
explanation of, 17 
use of, 17-18, 20 
Multiplication principle 
explanation of, 7-9, 11 
extended, 8, 9 
use of, 14 
Multiplicative treatment effects, 386, 387 
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Muon decay, 299-300 
Mutually independent events, 24 


Negative binomial distribution 
explanation of, 41, 302-305, A1 
use of, 303 
Neyman allocation, 233 
Neyman-Pearson paradigm 
explanation of, 331—333 
null hypothesis and, 335-336 
proof of, 332 
significance level and p-value and, 334—335 
uniformly most powerful tests and, 336 
use of, 333-334 
Nonparametric methods 
explanation of, 435 
Friedman's test and, 503—504 
Kruskal-Wallis test and, 488—489 
Normal approximation 
large sample, 296 
to sampling distribution, 214—220 
Normal density 
bivariate, 81—83, 91 
standard, 55 
Normal distribution 
bivariate, 145-146, 148—149 
comparing independent samples and, 
421-444 
comparing paired samples and, 446—447 
density function of, 54-55 
distributions derived from, 192-198 
expectation and, 119 
explanation of, 54—58, 257-258, A2 
finding probabilities from, 59-60 
maximum likelihood and, 269 
method of moments and, 263 
precision and, 290 
standard, 60, 61, 157-158 
testing goodness of fit to, 358-361 
variables and, 132 
Normal equations, 565—566 
Normal theory, for two-way layout, 492—499 
Null distribution 
analysis of variance and, 482, 483 
explanation of, 331, 333 
Kruskal-Wallis test and, 488 
Null hypothesis 
analysis of variance and, 481—483 
Bonferroni method and, 487 
explanation of, 331, 515 
Hardy-Weinberg equilibrium and, 343-344 
Kruskal-Wallis test and, 488 
Neyman-Pearson approach and, 334—336 
rejection of, 433, 458 
tests of, 315, 495 


Observational studies, 457, 458 
Odds ratio, 527—529 
One-sided alternative, 336, 425 
One-sided confidence intervals, 241—242 
One-way layout 
explanation of, 477-478 
F test and, 478—485 
Kruskal-Wallis test and, 488—489 
multiple comparisons and, 485—487 
random effects model for, 507—508 
Order statistics, 105—107, 352 
Outiers, 393-395 


Paired sample comparison 
example of, 450—452 
explanation of, 444—446 
normal distribution and, 446—447 
signed rank test and, 448—450 
Pairwise independent, 24 
Parameter estimation 
approach to, 260 
Bayesian approach to, 285-298 
explanation of, 257-260 
Parametric bootstrap, 311—312. See also 
Bootstrap 
Pareto distribution, 323 
Pearson correlation coefficient, 406 
Percentage points of the Bonferroni t statistic 
table, A20 
Percentiles 
of F distribution, AI0-A13 
of studentized range, Al4—A19 
of t distribution, A9 
of x? distribution, A8 
Permutations, 9-11 
Poisson dispersion test 
application of, 320, 375 
explanation of, 348-349 
Poisson distribution 
approximation of, 181—185 
compound, 160-161 
explanation of, 42-47, 156-157, 302, A2 
fitting, 286-288 
fit to radioactive decay, 255—257 
maximum likelihood and, 268—269, 
282-283 
method of moments and, 261—263 
unbiased estimator and, 302 
uses for, 45 
Poisson frequency function 
explanation of, 42-44 
uses for, 44—45 
Poissonness plot, 372-373 
Poisson probabilities, 45 
Poisson process, 46-47 


Poisson random variables 
expected value of, 117 
standardized, 181 
sum of independent, 159 

Polar method, 101 

Political surveys, 238-239 

Pooled sample variance, 422 

Pooling, 128-129 


Population correlation coefficient, 221—222 


Population covariance, 221 
Population mean 
confidence interval for, 217 
explanation of, 201 
Population parameters 
estimation of, 206-214 
explanation of, 200-202 
Population standard deviation, 202 
Population total 
estimation of, 210 
explanation of, 201 
Population variance 
estimation of, 210—214, 222-223 
explanation of, 201 
Posterior distribution, 286, 296 
Posterior mean, 288 
Posterior mode, 288 
Power 
calculations of, 433—435 
of test, 331 
Precision, 290 
Prediction 
explanation of, 152-153 
implementation of, 153-154 
Predictor variable, 542 
Priors 
conjugate, 294 
explanation of, 94 
improper, 295—296 
Poisson parameter and, 294—296 
Probability 
applications for, 1 
Bayesian approach and, 26 
conditional, 16-23 
converge in, 178 
frequency approach and, 26 
independence and, 23-26 
law of total, 18-19, 91 
methods for computing, 6-15 
multiplication principle and, 7—9 
overview of, 1-2, 26 


permutations and combinations and, 9-15 


Poisson, 45 

sample spaces and, 2—4 

use of, 22, 23 
Probability-generating function, 174 
Probability integral transform, 353 
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Probability mass function. See Frequency function 
Probability measures 
explanation of, 4 
properties of, 4-6 
Probability plots 
cautions regarding, 355 
examples of, 355—358, 502 
explanation of, 352-355 
exponential, 374 
linearity of, 360 
of residuals, 556—558 
Propagation of error ( method) 
application of, 165-166 
explanation of, 162 
Proportional allocation, 234—237 
Pseudorandom numbers, 63, 70 
p-value, 335, 346, 452 


QR method, 593 

Quadratic form, 571 

Quantile-quantile (Q-Q) plots, 385-389, 411 
Quotients, of random variables, 96-99 


Random effects model, 507-508 
Randomization, 456 
Randomized block designs, 500—503, 505 
Randomized response, 243 
Random sampling. See also Sampling 
advantages of, 200 
simple, 202—220, 236 
stratified, 227—238 
Random sums 
expectation and, 150-152 
moment-generating functions applied to, 
160-161 
Random variables 
Bernoulli, 37—38, 40, 126, 140, 305-307 
with binomial distribution, 40 
characteristic function of, 161 
continuous, 47—58, 75-84 
covariance of, 138—140 
cumulative distribution function of, 36 
discrete, 35-47, 72-75 
examples of, 35 
expectations of functions of, 121-124 
expectations of linear combinations of, 124-130 
expected value of, 116-121 
explanation of, 35, 64 
frequency function of, 36 
functions of, 58-63 
functions of jointly distributed, 96-104 
geometric, 117 
independent, 37, 84-86 
indicator, 37-38 


A62 Subject Index 


Random variables (continued) Sampling distribution 
Poisson, 117 explanation of, 203-205, 257, 260 
standardizing, 182 normal approximation to, 214—220 
uniform, 47 of sample mean, 220 
variance and standard deviation of, Sampling fraction, 209 
130-135 Scale parameter, 53 
vector-valued, 567—572 Scatterplots 
Rao-Blackwell theorem, 310 correlation and, 146 
Ratio estimate, 223—226 relationships with, 404—407 
Ratios. See also specific ratios Selective reduction, 450-452 
estimation of, 220-227 Serial correlation, 603 
expectation and variance of, 165-166 Set theory, 3-4 
Rayleigh density, 100 Shape parameter, 53 
Rayleigh distribution, 321—322, 376 Significance level, 331 
Regression, 561. See also Linear regression Sign test, 365, 461 
Rejection method, 92-94 Simple hypothesis, 332 
Rejection region, 331 Simple random sampling 
Replacement estimation of population variance and, 210-214 
sampling with, 9, 10, 207 expectation and variance and, 203-210 
sampling without, 9, 10, 207—209 explanation of, 202 
Residuals, 576-577 normal approximation and, 214—220 
Residual sum of squares (RSS), 549 proportional allocation and, 236 
Response variable, 542 Simpson's paradox, 7 
Retrospective case-control study, 536 Simulation 
Robust measures, 395 explanation of, 203 
Roosevelt-Landon survey, 239 method of moments and, 264—266 
Round-off error, 189 Skewness 


coefficient of, 359 
explanation of, 155 


Sabermetrics, 562—563 Slope, 547-550 

Sample mean Smoothing, local linear, 587—591 
explanation of, 195-198, 397 Squared multiple correlation coefficient, 583 
sampling distribution of, 220 St. Petersburg paradox, 118 
variance of, 207—209 Standard deviation 

Sample moment, 260 of normal density, 55 

Sample spaces, 2-4 population, 202 

Samples/sampling of random variables, 130-135 
cluster, 238 Standard error 
comparing paired, 444—452 of the estimate, 267 
comparing two independent, 421—444 estimated, 213, 214, 219, 262 
estimation of ratio and, 220—227 explanation of, 207, 209, 214, 260 
Gibbs, 297 Standardized residuals, 576—577 
methods for, 199—200, 527-528 Standard normal density, 55 
model for, 238 Standard statistical model 
political surveys and, 238-239 assumptions of, 554 
population parameters and, 200—202, 210 explanation of, 547—549 
problems associated with, 238-249 Stem-and-leaf plots 
prospective study method for, 527 example using, 394—395 
with replacement, 9, 10, 207 explanation of, 391—392 
retrospective study method for, 527—528 Strata, 227 
simple random, 202-220 Stratified estimates, 228—232 
stratified random, 227—238 Stratified random sampling 
systematic, 238 allocation methods and, 232—238 
variance of, 195-198, 207—209, 229—231 explanation of, 227-228 
without replacement, 9, 10, 207—209 properties of stratified estimates and, 228-232 


Sample standard deviation, 401—402 proportional allocation and, 234, 235 


Studentized range distribution, 485—486 
Subsets, 2-3 
Sufficiency 
explanation of, 305-306 
factorization theorem and, 306—310 
Rao-Blackwell theorem and, 310 
Sufficient statistic 
explanation of, 305 
one-dimensional, 308 
Rao-Blackwell theorem and, 310 
two-dimensional, 308 
Sums 
of random variables, 96-97 
of squares, 479—482 
Survival function 
empirical log, 384-385 
example of, 381—382 
explanation of, 380 
hazard function and, 383—384 
Systematic sampling, 238 


t distribution, 193-194, 198 
Test statistic 

calculation of, 436, 448 

explanation of, 331, 333, 425, 428 
Three-factor experiments, 504—505 
Tolerance interval, 106—107 
Transformations, variance-stabilizing, 351 
Trimmed mean, 397 
Tukey's method 

Bonferroni method vs., 487 

example of, 486—487 

explanation of, 485-486 

for multiple comparisons, 502 
Two-sided alternative, 336, 425 
Two-way layout 

additive parametrization and, 489-492 

explanation of, 489 
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Friedman's test and, 503-504 

normal theory for, 492-499 

randomized block designs and, 500—503 
Type I error, 331 
Type II error, 331 


Unbiased estimates 
explanation of, 206, 211, 212, 262 
variance of, 302 
Uniform density, 47 
Uniform distribution, A2 
Uniformly most powerful tests, 336 
Uniform random variables, 47 
Union, 3 


Value at risk (VaR), 49, 58 
Variability 
batch-to-batch, 508 
estimation of, 399—401 
measurement, 508 
Variance. See also Analysis of variance 
approximate confidence intervals and, 231-232 
calculation of, 132-133 
population, 201, 210—214, 222—223 
of random variables, 130-135 
of ratio, 165-166 
sample, 195—198, 207—209, 229—231, 422 
of sample mean, 203-210 
Variance-stabilizing transformation, 351 
Variation, 432 


Weibull cumulative distribution function, 69, 317 
Wilcoxon rank sum test. See Mann-Whitney test 
Wilcoxon signed-ranks test 

critical values for, A24 

explanation of, 448—450 
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